CN112270920A

CN112270920A - Voice synthesis method and device, electronic equipment and readable storage medium

Info

Publication number: CN112270920A
Application number: CN202011169694.3A
Authority: CN
Inventors: 袁俊; 陈昌滨; 王俊超; 聂志朋
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2020-10-28
Filing date: 2020-10-28
Publication date: 2021-01-26

Abstract

The application discloses a voice synthesis method, a voice synthesis device, electronic equipment and a readable storage medium, and relates to the technical field of voice processing and deep learning. The implementation scheme adopted by the application during speech synthesis is as follows: acquiring a text to be processed; determining a plurality of text segments contained in the text to be processed, and respectively corresponding to the voice style and voice tone of each text segment; respectively converting each text segment into an audio segment according to the voice style and the voice tone corresponding to each text segment; and splicing the audio segments obtained by conversion to obtain a voice synthesis result corresponding to the text to be processed. The method and the device can improve the pronunciation diversity of different roles in the voice synthesis result, and simultaneously enhance the style expressive force of the voice synthesis result, so that the voice synthesis result is more real and vivid.

Description

Voice synthesis method and device, electronic equipment and readable storage medium

Technical Field

The present application relates to the field of computer technologies, and in particular, to a method and an apparatus for speech synthesis, an electronic device, and a readable storage medium in the field of speech processing and deep learning technologies.

Background

At present, a large number of APPs or software with audio playing functions such as speaking and telling stories exist in the market. The APP with the audio playing function or part of audio resources in the software are synthesized by corresponding texts by utilizing a speech synthesis technology. In such scenarios, the speech synthesis technique can greatly reduce the labor cost required for recording the audio resource.

However, the audio generated by the existing speech synthesis technology is stereotyped and monotonous, has huge difference with the style expressive force and the rhythm sense of real people, especially professional dubbing actors, and is poor in user experience. In addition, the existing scheme uses fixed tone colors when synthesizing multi-role dialogue texts, and cannot bring personalized experience to users for role playing of multiple people. However, in a personalized synthesis scenario, a user wants to synthesize audio with any given tone, and the existing technical solutions cannot achieve professional dubbing-like rhythmicity and any tone output at the same time. In summary, there is no technical solution for users to synthesize multi-character dialog texts by giving a plurality of arbitrary timbres, and ensure that the synthesized audio has the prosody similar to professional dubbing.

Disclosure of Invention

The technical solution adopted by the present application to solve the technical problem is to provide a speech synthesis method, including: acquiring a text to be processed; determining a plurality of text segments contained in the text to be processed, and respectively corresponding to the voice style and voice tone of each text segment; respectively converting each text segment into an audio segment according to the voice style and the voice tone corresponding to each text segment; and splicing the audio segments obtained by conversion to obtain a voice synthesis result corresponding to the text to be processed.

The technical solution adopted by the present application to solve the technical problem is to provide a speech synthesis apparatus, including: the acquisition unit is used for acquiring a text to be processed; the determining unit is used for determining a plurality of text segments contained in the text to be processed, and the voice style and the voice tone of each text segment respectively; the synthesis unit is used for respectively converting each text segment into an audio segment according to the voice style and the voice tone corresponding to each text segment; and the processing unit is used for splicing the audio segments obtained by conversion to obtain a speech synthesis result corresponding to the text to be processed.

An electronic device, comprising: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the above method.

A non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the above method.

One embodiment in the above application has the following advantages or benefits: this application can realize obtaining the speech synthesis result through the arbitrary tone quality of diversified speech style collocation for the speech synthesis result is rich in rhythm and expressive force more, promotes authenticity, the vividness of speech synthesis result. Because the technical means of acquiring the voice styles and voice timbres corresponding to different text segments in the text to be processed is adopted, the technical problem that the synthesized audio resources are monotonous due to the fact that fixed styles or fixed timbres are used for different voice segments in the prior art is solved, the pronunciation diversity of different roles is improved during voice synthesis, the style expressive force of voice synthesis results is enhanced, and the voice synthesis results are more real and vivid.

Other effects of the above-described alternative will be described below with reference to specific embodiments.

Drawings

The drawings are included to provide a better understanding of the present solution and are not intended to limit the present application. Wherein:

FIG. 1 is a schematic diagram according to a first embodiment of the present application;

FIG. 2 is a schematic diagram according to a second embodiment of the present application;

FIG. 3 is a schematic illustration according to a third embodiment of the present application;

fig. 4 is a block diagram of an electronic device for implementing a speech synthesis method according to an embodiment of the present application.

Detailed Description

The following description of the exemplary embodiments of the present application, taken in conjunction with the accompanying drawings, includes various details of the embodiments of the application for the understanding of the same, which are to be considered exemplary only. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present application. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

Fig. 1 is a schematic diagram according to a first embodiment of the present application. As shown in fig. 1, the speech synthesis method of this embodiment may specifically include the following steps:

s101, acquiring a text to be processed;

s102, determining a plurality of text segments contained in the text to be processed, and respectively corresponding to the voice style and voice tone of each text segment;

s103, converting each text segment into an audio segment according to the corresponding voice style and voice tone of each text segment;

and S104, splicing the audio segments obtained by conversion to obtain a voice synthesis result corresponding to the text to be processed.

According to the voice synthesis method, the text segments are respectively converted into the audio segments according to the voice styles and voice timbres corresponding to the different text segments in the text to be processed, and then the audio segments are spliced to obtain the voice synthesis result, so that the pronunciation diversity of different roles in the voice synthesis result is improved, the style expressive force of the voice synthesis result is enhanced, and the voice synthesis result is more real and vivid.

The text to be processed obtained by the embodiment executing S101 may be a text containing multi-role dialogue and dialogue, such as a novel text or a story text. The present embodiment does not limit the type of the text to be processed acquired in S101.

After the to-be-processed text is acquired by executing S101, executing S102 to determine a plurality of text segments included in the to-be-processed text, and a voice style and a voice tone respectively corresponding to each text segment, that is, to determine different voice styles and different voice tones of different text segments.

In this embodiment, when S102 is executed to determine a plurality of text segments included in the text to be processed, the text to be processed may be split in units of sentences, so that each split sentence is used as a plurality of text segments included in the text to be processed, and each obtained text segment corresponds to a character in the text or a Chinese character in the text.

The present embodiment performs S102 the determination of a speech style for representing the emotional type of the segment of text, which may comprise ordinary, happy, angry, anecdotal, yesterling, disconcerting, confusion, respect and cynicism, etc.; the determined voice timbre is used to represent a speaking role for the text segment, such as an onwhite or a specific character role.

It is understood that the voice timbre determined by the embodiment executing S102 may correspond to a specific person in the text to be processed, for example, the name of a person in the novel; but also the occupation or identity of the particular person in the text to be processed, such as teacher, doctor, father, mother, etc.

Specifically, in this embodiment, when S102 is executed to determine the speech style corresponding to each text segment, the optional implementation manner that may be adopted is: extracting emotional words in each text segment, such as words of "anew", "anger", "prosperous", and the like appearing in the text segment; the emotion type corresponding to the extracted emotion word is determined as the speech style corresponding to each text segment, and for example, the emotion type corresponding to the emotion word "happy" is "happy", the emotion type corresponding to the emotion word "anecdotal" is "anecdotal", and the like.

In this embodiment, the voice style determined in S102 is executed, that is, the emotion to be embodied by the audio segment in the voice synthesis result, and by determining the voice style of the text segment, the generated audio segment can more accurately embody the emotion change of the character role, so that the vividness of the voice synthesis result is improved.

It can be understood that one emotion word corresponds to a unique emotion type, and one emotion type may correspond to a plurality of emotion words, so that each text segment can obtain a corresponding voice style under the condition that voice style redundancy is avoided as much as possible.

In addition, if the embodiment cannot extract an emotion word from the text segment when executing S102, for example, the text segment is a bystander, the embodiment sets the voice style of the text segment as a default voice style, for example, as a "normal" voice style, which is a voice style without emotion change.

Specifically, in this embodiment, when S102 is executed to determine the tone color of the speech corresponding to each text segment, the optional implementation manner that can be adopted is as follows: extracting role identification in each text segment; the voice timbre corresponding to each text segment is determined according to the extracted character identifier, for example, the extracted character identifier may be "father", "mother", "teacher", "zhang san", or the like.

In this embodiment, when S102 is executed to determine the voice timbre corresponding to each text segment according to the extracted character identifier, the voice timbre may be obtained according to a preset character-timbre correspondence table.

In addition, when S102 is executed to determine the voice timbre corresponding to each text segment according to the extracted character identifier, the embodiment may adopt an optional implementation manner as follows: acquiring audio data corresponding to the extracted character identifier, which is input by a user; and extracting the tone in the acquired audio data as the voice tone corresponding to each text segment.

That is to say, in the embodiment, the voice timbre corresponding to each text segment can be determined through the audio data input by the user in real time, so that the text segment can be converted into an audio segment with any timbre according to the requirement of the user, thereby further enhancing the user participation during voice synthesis, and realizing personalized playing of multiple roles during voice synthesis.

In addition, if the present embodiment cannot extract the character identifier from the text segment when executing S102, for example, the text segment is a voice over, the present embodiment sets the voice tone of the text segment as a default voice tone, for example, as a voice over.

In order to further improve the efficiency of speech synthesis, in this embodiment, when S102 is executed to determine the speech style and the speech timbre respectively corresponding to each text segment, the optional implementation manners that can be adopted are as follows: inputting each text segment into a text analysis model obtained by pre-training, and determining a voice style and a voice role respectively corresponding to each text segment according to an output result of the text analysis model; the voice tone corresponding to each text segment is determined according to the voice role, for example, by querying a role-tone correspondence table. The text analysis model related to the embodiment belongs to a neural network model in the field of deep learning, and can output the voice style and the voice role of the text segment according to the input text segment.

In this embodiment, after the step S102 is executed to determine the voice style and the voice tone respectively corresponding to each text segment, the step S103 is executed to respectively convert each text segment into an audio segment according to the voice style and the voice tone corresponding to each text segment. The audio clip obtained by the conversion in this embodiment has an emotion type corresponding to the speech style and a tone color corresponding to the character identifier.

That is to say, when the speech synthesis of the text segment is performed, the emotion change and the tone personalization of the synthesized audio segment can be considered at the same time, so that the problem that the synthesized audio segment is monotonous and stereotyped in the prior art is avoided, and the synthesized audio segment is more real and vivid.

In this embodiment, when S103 is executed to convert each text segment into an audio segment according to the voice style and the voice tone corresponding to each text segment, the optional implementation manners that can be adopted are as follows: and inputting the text segments, the voice styles and voice timbres corresponding to the text segments into an audio generation model obtained by pre-training, and obtaining the audio segments corresponding to the text segments according to the output result of the audio generation model. The audio generation model related to the embodiment belongs to a neural network model in the field of deep learning, and can generate an audio segment conforming to the voice style and the voice tone of an input text segment according to the input text segment and the voice style and the voice tone corresponding to the text segment.

For example, if step S102 determines that the voice style corresponding to the text segment 1 is "anger" and the voice timbre is "stable male", step S103 results in an audio segment that "angry" the text content in the text segment 1 is spoken in "stable male" timbre.

In this embodiment, after S103 is executed to convert each text segment into an audio segment, S104 is executed to splice the converted audio segments, so as to obtain a speech synthesis result corresponding to the text to be processed.

Specifically, in this embodiment, when the audio segment obtained by performing S104 splicing conversion is obtained to obtain a speech synthesis result corresponding to the text to be processed, an optional implementation manner that can be adopted is as follows: sequentially splicing the audio clips according to the sequence of the text clips corresponding to the audio clips in the text to be processed; and taking the splicing result of each audio clip as a voice synthesis result corresponding to the text to be processed.

According to the technical scheme provided by the embodiment, the voice styles and voice timbres corresponding to different text segments in the text to be processed are determined, and then different text segments are converted by using different voice styles and voice timbres, so that the audio segments of the text segments have different voice styles and voice timbres, the pronunciation diversity of different roles in the voice synthesis result is improved, the style expressive force of the voice synthesis result is enhanced, and the voice synthesis result is more real and vivid.

Fig. 2 is a schematic diagram according to a second embodiment of the present application. As shown in fig. 2, when executing S103 to convert each text segment into an audio segment according to the voice style and voice timbre corresponding to each text segment, the embodiment may specifically include the following steps:

s201, obtaining style codes corresponding to the voice style;

s202, converting the text segment into a first acoustic feature by utilizing the style coding;

s203, converting the first acoustic feature into a second acoustic feature by using the voice timbre;

and S204, converting the second acoustic feature into an audio clip.

That is to say, in the embodiment, for each text segment, the style code and the timbre corresponding to each text segment are respectively acquired to realize conversion from the text segment to the audio segment, so that the audio segment obtained through conversion has respective emotion and timbre, and the audio segment obtained through conversion is more real and vivid.

The style code obtained in S201 is specifically a Mel-frequency spectrum (Mel spectrum), which is a speech feature extracted from audio data of different speech styles and is used for reflecting different emotion types in the audio data.

It can be understood that, in order to make the converted audio segment have professionally typical rhythmic feelings and expressive power, the audio data for extracting the mel frequency spectrum in the embodiment may be recorded by professional dubbing actors, and the recorded audio data covers the fundamental style with higher occurrence frequency in the novel text or the story text.

Generally, one text segment corresponds to one voice style, but there are cases that one text segment corresponds to multiple voice styles, and multiple voice styles correspond to multiple style codes, for example, the style codes corresponding to the voice styles are "angry" and "innocent", and the prior art cannot convert the text segment by using multiple style codes at the same time; in addition, there is a case where a style code corresponding to a speech style cannot be acquired, for example, there is no style code corresponding to "calm" speech style. Both of the above situations result in the text segment not being converted into the audio segment, thereby reducing the success rate of speech synthesis.

In order to further improve the success rate of speech synthesis, in this embodiment, when the style code corresponding to the speech style is obtained in S201, the optional implementation manner that may be adopted is: determining candidate style codes corresponding to the voice styles; and interpolating the determined candidate style codes, and taking the interpolation result as the style code corresponding to the voice style. The embodiment can realize the mixing of different style codes or the change of the strength of a certain style code by interpolating the candidate style codes.

When determining candidate style codes corresponding to the voice style in S201, the present embodiment may use a plurality of style codes corresponding to the voice style as the candidate style codes, for example, "angry" and "anecdotal" style codes corresponding to the voice style "angry and anecdotal" as the candidate style codes; a plurality of style codes having a high similarity to the speech style may be used as candidate style codes, for example, "gutter" and "zheng" style codes having a high similarity to "calm" speech style may be used as candidate style codes.

In this embodiment, the first acoustic feature obtained by the conversion in S202 and the second acoustic feature obtained by the conversion in S204 are specifically mel-frequency spectrums, where the first acoustic feature includes text contents of a specific style and text segment, and the second acoustic feature includes a specific sound color in addition to the above contents.

In the embodiment, when performing S202 to convert the text segment into the first acoustic feature by using style coding, the optional implementation manners that can be adopted are: and inputting the text segment and the style code into a pre-trained speech synthesis model, and taking an output result of the speech synthesis model as a first acoustic feature. The speech synthesis model in this embodiment can output acoustic features of text content including a specific style and a text segment according to the input text segment and the style code.

In the present embodiment, when performing S203 to convert the first acoustic feature into the second acoustic feature by using the voice timbre, the following optional implementation manners may be adopted: and inputting the voice tone and the first acoustic feature into a tone conversion model obtained by pre-training, and taking an output result of the tone conversion model as a second acoustic feature. The tone conversion model in this embodiment can output the acoustic features including the specific tone according to the input acoustic features and the voice tone, and only change the tone in the acoustic features, but not change the text content of the specific style and text segment included in the acoustic features. The speech synthesis model and the tone conversion model of the embodiment belong to a neural network model in the deep learning field.

The embodiment performs S204 to convert the second acoustic feature into an audio segment, which may be implemented by using an existing vocoder, for example, a world, straight or griffin _ lim based vocoder.

Fig. 3 is a schematic diagram according to a third embodiment of the present application. As shown in fig. 3, the speech synthesis apparatus of the present embodiment includes:

the acquiring unit 301 is used for acquiring a text to be processed;

the determining unit 302 is configured to determine a plurality of text segments included in the text to be processed, and a voice style and a voice tone respectively corresponding to each text segment;

the synthesis unit 303 is configured to convert each text segment into an audio segment according to the voice style and voice tone corresponding to each text segment;

the processing unit 304 is configured to splice the converted audio segments to obtain a speech synthesis result corresponding to the text to be processed.

The text to be processed acquired by the acquiring unit 301 in this embodiment may be a text containing multi-role dialogue and dialogue, such as a novel text or a story text.

In the present embodiment, after the acquisition unit 301 acquires the text to be processed, the determination unit 302 determines a plurality of text segments included in the text to be processed, and a voice style and a voice tone respectively corresponding to each text segment, that is, the determination unit 302 determines that different text segments have different voice styles and different voice tones.

In this embodiment, when determining a plurality of text segments included in the text to be processed, the determining unit 302 may split the text to be processed in units of sentences, so that each split sentence is used as a plurality of text segments included in the text to be processed, and each obtained text segment corresponds to a character in the text or a Chinese character in the text.

The speech style determined by the determination unit 302 in this embodiment, which is used to represent the emotion type of the text passage, may include ordinary, happy, angry, anecdotal, yesterling, disconcerting, confusion, respect with cynicism, etc.; the voice timbre determined by the determination unit 302 is used to represent the speaking role of the text segment, such as a voice-over or a specific human character.

It is understood that, in the embodiment, the voice timbre determined by the determining unit 302 may correspond to a specific person in the text to be processed, for example, the name of a person in the novel; but also the occupation or identity of the particular person in the text to be processed, such as teacher, doctor, father, mother, etc.

Specifically, in this embodiment, when determining the speech style corresponding to each text segment, the determining unit 302 may adopt an optional implementation manner as follows: extracting emotional words in each text segment; and determining the emotion type corresponding to the extracted emotion words as the voice style corresponding to each text segment.

In this embodiment, the voice style determined by the determining unit 302 is the emotion to be embodied by the audio segment in the voice synthesis result, and by determining the voice style of the text segment, the generated audio segment can more accurately embody the emotion change of the character, so that the vividness of the voice synthesis result is improved.

In addition, if determining section 302 in this embodiment is not able to extract an emotion word from a text segment, determining section 302 sets the speech style of the text segment as a default speech style.

Specifically, when determining the voice timbre corresponding to each text segment, the determining unit 302 in this embodiment may adopt an optional implementation manner as follows: extracting role identification in each text segment; and determining the voice timbre corresponding to each text segment according to the extracted role identification.

In this embodiment, when determining the voice timbre corresponding to each text segment according to the extracted character identifier, the determining unit 302 in this embodiment may obtain the voice timbre according to a preset character-timbre correspondence table.

In addition, when the determining unit 302 in this embodiment determines the voice timbre corresponding to each text segment according to the extracted character identifier, the optional implementation manner that can be adopted is as follows: acquiring audio data corresponding to the extracted character identifier, which is input by a user; and extracting the tone in the acquired audio data as the voice tone corresponding to each text segment.

In addition, if the determination unit 302 of this embodiment is not able to extract the character identifier from the text segment, the determination unit 302 sets the voice tone of the text segment as a default voice tone.

In order to further improve the efficiency of speech synthesis, when determining the speech style and the speech timbre respectively corresponding to each text segment, the determining unit 302 in this embodiment may adopt an optional implementation manner as follows: inputting each text segment into a text analysis model obtained by pre-training, and determining a voice style and a voice role respectively corresponding to each text segment according to an output result of the text analysis model; and determining the voice tone corresponding to each text segment according to the voice role. The text analysis model according to this embodiment can output the speech style and the speech role of the text segment according to the input text segment.

In this embodiment, after the determining unit 302 determines the voice style and voice tone respectively corresponding to each text segment, the synthesizing unit 303 converts each text segment into an audio segment according to the voice style and voice tone corresponding to each text segment. The audio segment converted by the synthesis unit 303 has an emotion type corresponding to the speech style and a tone corresponding to the character identifier.

In this embodiment, when the synthesizing unit 303 converts each text segment into an audio segment according to the voice style and the voice tone corresponding to each text segment, the optional implementation manners that can be adopted are: and inputting the text segments, the voice styles and voice timbres corresponding to the text segments into an audio generation model obtained by pre-training, and obtaining the audio segments corresponding to the text segments according to the output result of the audio generation model. The audio generation model according to this embodiment can generate an audio clip that matches the voice style and voice tone of an input text segment according to the input text segment, and the voice style and voice tone corresponding to the text segment.

In addition, when the synthesizing unit 303 in this embodiment converts each text segment into an audio segment according to the voice style and the voice tone corresponding to each text segment, the optional implementation manners that can be adopted are as follows: acquiring style codes corresponding to the voice styles; converting the text segment into a first acoustic feature using style coding; converting the first acoustic feature into a second acoustic feature using the voice timbre; the second acoustic feature is converted into an audio clip.

That is to say, the synthesizing unit 303 realizes conversion from the text segment to the audio segment by respectively obtaining the style code and the timbre corresponding to each text segment, and ensures that the converted audio segment has respective emotion and timbre, so that the converted audio segment is more real and vivid.

The style code acquired by the synthesizing unit 303 in this embodiment is specifically a Mel spectrum (Mel spectrum), which is a speech feature extracted from audio data of different speech styles and is used for reflecting different emotion types in the audio data.

In order to further improve the success rate of speech synthesis, when the synthesis unit 303 in this embodiment obtains the style code corresponding to the speech style, the optional implementation manner that may be adopted is: determining candidate style codes corresponding to the voice styles; and interpolating the determined candidate style codes, and taking the interpolation result as the style code corresponding to the voice style. The synthesis unit 303 can realize mixing between different style codes or change of the strength of a certain style code by interpolating the candidate style codes.

When determining candidate style codes corresponding to the voice style, the synthesizing unit 303 in this embodiment may use a plurality of style codes corresponding to the voice style as the candidate style codes; a plurality of style codes having a high similarity to the speech style may be used as candidate style codes.

The first acoustic feature and the second acoustic feature obtained by the synthesis unit 303 in this embodiment are specifically mel-frequency spectrums, where the first acoustic feature includes text contents of a specific style and text fragment, and the second acoustic feature includes a specific tone besides the above contents.

When the synthesizing unit 303 in this embodiment converts the text segment into the first acoustic feature by using the style coding, the optional implementation manners that may be adopted are: and inputting the text segment and the style code into a pre-trained speech synthesis model, and taking an output result of the speech synthesis model as a first acoustic feature. The speech synthesis model in this embodiment can output acoustic features of text content including a specific style and a text segment according to the input text segment and the style code.

When the synthesis unit 303 in this embodiment converts the first acoustic feature into the second acoustic feature by using the voice timbre, the following optional implementation manners may be adopted: and inputting the voice tone and the first acoustic feature into a tone conversion model obtained by pre-training, and taking an output result of the tone conversion model as a second acoustic feature. The tone conversion model in this embodiment can output the acoustic features including the specific tone according to the input acoustic features and tone, and only change the tone in the acoustic features, but not change the text content of the specific style and text segment included in the acoustic features.

The synthesizing unit 303 in this embodiment may use an existing vocoder to convert the second acoustic feature into the audio segment, for example, a world, straight or griffin _ lim based vocoder is used to convert the acoustic feature into the audio segment.

In this embodiment, after the synthesizing unit 303 converts each text segment into an audio segment, the processing unit 304 splices the converted audio segments, so as to obtain a speech synthesis result corresponding to the text to be processed.

Specifically, when the processing unit 304 in this embodiment concatenates the converted audio segments to obtain a speech synthesis result corresponding to the text to be processed, the optional implementation manner that can be adopted is as follows: sequentially splicing the audio clips according to the sequence of the text clips corresponding to the audio clips in the text to be processed; and taking the splicing result of each audio clip as a voice synthesis result corresponding to the text to be processed.

According to an embodiment of the present application, an electronic device and a computer-readable storage medium are also provided.

Fig. 4 is a block diagram of an electronic device according to the speech synthesis method of the embodiment of the present application. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the present application that are described and/or claimed herein.

As shown in fig. 4, the electronic apparatus includes: one or more processors 401, memory 402, and interfaces for connecting the various components, including high-speed interfaces and low-speed interfaces. The various components are interconnected using different buses and may be mounted on a common motherboard or in other manners as desired. The processor may process instructions for execution within the electronic device, including instructions stored in or on the memory to display graphical information of a GUI on an external input/output apparatus (such as a display device coupled to the interface). In other embodiments, multiple processors and/or multiple buses may be used, along with multiple memories and multiple memories, as desired. Also, multiple electronic devices may be connected, with each device providing portions of the necessary operations (e.g., as a server array, a group of blade servers, or a multi-processor system). In fig. 4, one processor 401 is taken as an example.

Memory 402 is a non-transitory computer readable storage medium as provided herein. Wherein the memory stores instructions executable by at least one processor to cause the at least one processor to perform the speech synthesis methods provided herein. The non-transitory computer-readable storage medium of the present application stores computer instructions for causing a computer to perform the speech synthesis method provided by the present application.

The memory 402, which is a non-transitory computer-readable storage medium, may be used to store non-transitory software programs, non-transitory computer-executable programs, and modules, such as program instructions/modules corresponding to the method of searching for an emoticon in the embodiment of the present application (for example, the acquisition unit 301, the determination unit 302, the synthesis unit 303, and the processing unit 304 shown in fig. 3). The processor 401 executes various functional applications of the server and data processing, i.e., implements the speech synthesis method in the above-described method embodiments, by running non-transitory software programs, instructions, and modules stored in the memory 402.

The memory 402 may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created according to the use of the electronic device, and the like. Further, the memory 402 may include high speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, the memory 402 may optionally include memory located remotely from the processor 401, and these remote memories may be connected to the speech synthesis method electronics over a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The electronic device of the speech synthesis method may further include: an input device 403 and an output device 404. The processor 401, the memory 402, the input device 403 and the output device 404 may be connected by a bus or other means, and fig. 4 illustrates an example of a connection by a bus.

The input device 403 may receive input numeric or character information and generate key signal inputs related to user settings and function control of the electronic apparatus of the speech synthesis method, such as a touch screen, a keypad, a mouse, a track pad, a touch pad, a pointing stick, one or more mouse buttons, a track ball, a joystick, or other input devices. The output devices 404 may include a display device, auxiliary lighting devices (e.g., LEDs), and haptic feedback devices (e.g., vibrating motors), among others. The display device may include, but is not limited to, a Liquid Crystal Display (LCD), a Light Emitting Diode (LED) display, and a plasma display. In some implementations, the display device can be a touch screen.

Various implementations of the systems and techniques described here can be realized in digital electronic circuitry, integrated circuitry, application specific ASICs (application specific integrated circuits), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

These computer programs (also known as programs, software applications, or code) include machine instructions for a programmable processor, and may be implemented using high-level procedural and/or object-oriented programming languages, and/or assembly/machine languages. As used herein, the terms "machine-readable medium" and "computer-readable medium" refer to any computer program product, apparatus, and/or device (e.g., magnetic discs, optical disks, memory, Programmable Logic Devices (PLDs)) used to provide machine instructions and/or data to a programmable processor, including a machine-readable medium that receives machine instructions as a machine-readable signal. The term "machine-readable signal" refers to any signal used to provide machine instructions and/or data to a programmable processor.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The Server can be a cloud Server, also called a cloud computing Server or a cloud host, and is a host product in a cloud computing service system, so as to solve the defects of high management difficulty and weak service expansibility in the traditional physical host and VPS service ("Virtual Private Server", or simply "VPS").

According to the technical scheme of the embodiment of the application, the text segments are respectively converted into the audio segments according to the voice styles and voice timbres corresponding to the different text segments in the text to be processed, and then the audio segments are spliced to obtain the voice synthesis result, so that the pronunciation diversity of different roles in the voice synthesis result is improved, the style expressive force of the voice synthesis result is enhanced, and the voice synthesis result is more real and vivid.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present application may be executed in parallel, sequentially, or in different orders, as long as the desired results of the technical solutions disclosed in the present application can be achieved, and the present invention is not limited herein.

The above-described embodiments should not be construed as limiting the scope of the present application. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present application shall be included in the protection scope of the present application.

Claims

1. A method of speech synthesis comprising:

acquiring a text to be processed;

determining a plurality of text segments contained in the text to be processed, and respectively corresponding to the voice style and voice tone of each text segment;

respectively converting each text segment into an audio segment according to the voice style and the voice tone corresponding to each text segment;

and splicing the audio segments obtained by conversion to obtain a voice synthesis result corresponding to the text to be processed.

2. The method of claim 1, wherein determining a speech style for each text segment comprises:

extracting emotional words in each text segment;

and determining the emotion type corresponding to the emotion words as the voice style corresponding to each text segment.

3. The method of claim 1, wherein determining the voice timbre corresponding to each text segment comprises:

extracting role identification in each text segment;

and determining the voice tone corresponding to each text segment according to the role identification.

4. The method of claim 3, wherein the determining the voice timbre corresponding to each text segment according to the role identifier comprises:

acquiring audio data corresponding to the character identifier, which is input by a user;

and extracting the tone in the audio data as the voice tone corresponding to each text segment.

5. The method of claim 1, wherein the converting each text segment into an audio segment according to the voice style and voice timbre corresponding to each text segment comprises:

acquiring style codes corresponding to the voice styles aiming at each text segment;

converting the text segment into a first acoustic feature using the style code;

converting the first acoustic feature into a second acoustic feature using the voice timbre;

converting the second acoustic feature into an audio clip.

6. The method of claim 5, wherein said obtaining a style code corresponding to the speech style comprises:

determining candidate style codes corresponding to the voice style;

and interpolating the candidate style codes, and taking the interpolation result as the style code corresponding to the voice style.

7. The method of claim 5, wherein said converting text segments into first acoustic features using said style encoding comprises:

inputting the text segment and the style code into a pre-trained speech synthesis model, and taking an output result of the speech synthesis model as the first acoustic feature.

8. The method of claim 5, wherein the transforming the first acoustic feature into a second acoustic feature using the voice timbre comprises:

and inputting the first acoustic feature and the voice tone into a tone conversion model obtained by pre-training, and taking an output result of the tone conversion model as the second acoustic feature.

9. The method of claim 1, wherein the splicing the converted audio segments to obtain the speech synthesis result corresponding to the text to be processed comprises:

sequentially splicing the audio clips according to the sequence of the text clips corresponding to the audio clips in the text to be processed;

and taking the splicing result of each audio clip as a voice synthesis result corresponding to the text to be processed.

10. A speech synthesis apparatus comprising:

the acquisition unit is used for acquiring a text to be processed;

the determining unit is used for determining a plurality of text segments contained in the text to be processed, and the voice style and the voice tone of each text segment respectively;

the synthesis unit is used for respectively converting each text segment into an audio segment according to the voice style and the voice tone corresponding to each text segment;

and the processing unit is used for splicing the audio segments obtained by conversion to obtain a speech synthesis result corresponding to the text to be processed.

11. The apparatus according to claim 10, wherein the determining unit, when determining the speech style corresponding to each text segment, specifically performs:

extracting emotional words in each text segment;

12. The apparatus according to claim 10, wherein the determining unit, when determining the tone of the speech corresponding to each text segment, specifically performs:

extracting role identification in each text segment;

13. The apparatus according to claim 12, wherein the determining unit, when determining the tone of the speech corresponding to each text segment according to the role identifier, specifically performs:

14. The apparatus according to claim 10, wherein the synthesizing unit, when converting each text segment into an audio segment according to the voice style and voice timbre corresponding to each text segment, specifically performs:

converting the text segment into a first acoustic feature using the style code;

converting the second acoustic feature into an audio clip.

15. The apparatus according to claim 14, wherein the synthesis unit, when obtaining the style code corresponding to the speech style, specifically performs:

determining candidate style codes corresponding to the voice style;

16. The apparatus according to claim 14, wherein the synthesis unit, when converting the text segment into the first acoustic feature using the style coding, specifically performs:

17. The apparatus according to claim 14, wherein the synthesis unit, when converting the first acoustic feature into a second acoustic feature using the voice timbre, specifically performs:

18. The apparatus according to claim 10, wherein when the processing unit concatenates the converted audio segments to obtain the speech synthesis result corresponding to the text to be processed, the processing unit specifically performs:

19. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-9.

20. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 1-9.