CN111415650A

CN111415650A - Text-to-speech method, device, equipment and storage medium

Info

Publication number: CN111415650A
Application number: CN202010220555.2A
Authority: CN
Inventors: 罗忠岚
Original assignee: Guangzhou Kugou Computer Technology Co Ltd
Current assignee: Guangzhou Kugou Computer Technology Co Ltd
Priority date: 2020-03-25
Filing date: 2020-03-25
Publication date: 2020-07-14

Abstract

The application discloses a text-to-speech conversion method, a text-to-speech conversion device, a text-to-speech conversion equipment and a storage medium, and belongs to the technical field of computers. The method comprises the following steps: determining at least one text segment included in the target text; acquiring context information corresponding to each text segment, and determining at least one type of human voice audio attribute information corresponding to each text segment based on the context information corresponding to each text segment; respectively converting each text segment into a voice audio segment based on at least one type of voice audio attribute information corresponding to each text segment; and synthesizing the voice audio frequency corresponding to the target text by the voice audio frequency corresponding to each text segment according to the arrangement sequence of each text segment in the target text. The voice audio converted by the method and the device can be correspondingly changed based on the change of the context, so that the voice audio is more flexible when being converted.

Description

Text-to-speech method, device, equipment and storage medium

Technical Field

The present application relates to the field of computer technologies, and in particular, to a method, an apparatus, a device, and a storage medium for text-to-speech conversion.

Background

When a user wants to browse an article, the text in the article can be converted into sound, so that the user does not need to use eyes to read the article.

In the course of implementing the present application, the inventors found that the related art has at least the following problems:

in the above process, no matter what type of article the user browses or any place where the article is browsed, the user can read the article by the default sound of the system, that is, the user can read the whole article of different types by the same sound, so that the form is single, and the flexibility is poor.

Disclosure of Invention

In order to solve technical problems in the related art, embodiments of the present application provide a method, an apparatus, a device, and a storage medium for text-to-speech conversion. The technical scheme of the text-to-speech conversion method, the text-to-speech conversion device, the text-to-speech conversion equipment and the storage medium is as follows:

in a first aspect, a method for text-to-speech conversion is provided, the method including:

determining at least one text segment included in the target text;

acquiring context information corresponding to each text segment, and determining at least one type of human voice audio attribute information corresponding to each text segment based on the context information corresponding to each text segment;

respectively converting each text segment into a voice audio segment based on at least one type of voice audio attribute information corresponding to each text segment;

and synthesizing the voice audio frequency corresponding to the target text by the voice audio frequency corresponding to each text segment according to the arrangement sequence of each text segment in the target text.

Optionally, the determining at least one text segment included in the target text includes:

if the target text comprises a text part of a conversation type, dividing the text part of the conversation type based on the switching of conversation roles to obtain at least one text fragment;

and if the target text comprises a text part of a non-dialog type, dividing the text part of the non-dialog type based on chapters or paragraphs to obtain at least one text fragment.

Optionally, the at least one type of human voice audio attribute information includes:

at least one of a voice audio type, a speech rate, a tone, and a volume.

Optionally, after converting each text segment into a human sound audio segment, the method further includes:

obtaining music audio corresponding to each text segment based on the context information corresponding to each text segment and the corresponding relation between the pre-stored context information and the music audio;

synthesizing the music audio corresponding to each text segment with the voice audio segment corresponding to each text segment to obtain the voice audio segment which is added with the music audio and corresponds to each text segment;

synthesizing the voice audio corresponding to the target text by the voice audio corresponding to each text segment according to the arrangement sequence of each text segment in the target text, including:

and synthesizing the voice audio frequency of the added music audio frequency corresponding to the target text according to the arrangement sequence of each text segment in the target text.

Optionally, the synthesizing the music audio segment corresponding to each text segment with the human voice audio segment corresponding to each text segment to obtain the human voice audio segment to which the music audio is added corresponding to each text segment includes:

for each text segment, determining the voice duration corresponding to the voice audio segment corresponding to the text segment, and determining a music audio segment with the duration equal to the voice duration based on the music audio corresponding to the text segment;

and superposing and synthesizing the music audio segments corresponding to the text segments and the voice audio segments to obtain the voice audio segments which are added with the music audio and correspond to the text segments.

In a second aspect, an apparatus for text-to-speech conversion is provided, the apparatus comprising:

a first determination module configured to determine at least one text segment included in the target text;

the second determining module is configured to acquire the context information corresponding to each text segment, and determine at least one type of human voice audio attribute information corresponding to each text segment based on the context information corresponding to each text segment;

the conversion module is configured to convert each text segment into a human voice audio segment respectively based on at least one human voice audio attribute information corresponding to each text segment;

and the synthesis module is configured to synthesize the voice audio frequency corresponding to the target text by the voice audio frequency fragment corresponding to each text fragment according to the arrangement sequence of each text fragment in the target text.

Optionally, the first determining module is configured to:

at least one of a voice audio type, a speech rate, a tone, and a volume.

Optionally, the apparatus further includes an adding module configured to:

the synthesis module configured to:

Optionally, the adding module is configured to:

In a third aspect, a computer device is provided, which includes a processor and a memory, where at least one instruction is stored, and the instruction is loaded and executed by the processor to implement the operations performed by the method of the first aspect.

In a fourth aspect, a computer-readable storage medium is provided, in which at least one instruction is stored, the instruction being loaded and executed by a processor to implement the operations performed by the method of the first aspect.

The technical scheme provided by the embodiment of the application has the following beneficial effects:

the target text is divided into a plurality of text segments, the context information corresponding to each text segment is determined, at least one type of voice audio attribute corresponding to each text segment is further determined, each text segment is converted into a voice audio segment, each voice audio segment is synthesized, the voice audio corresponding to the target text is obtained, the converted voice audio can be correspondingly changed based on the change of the context, and the audio conversion is more flexible.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

FIG. 1 is a flow chart of a method for text-to-speech conversion provided by an embodiment of the present application;

FIG. 2 is a schematic diagram of text-to-speech conversion provided by an embodiment of the present application;

FIG. 3 is a schematic structural diagram of an apparatus for text-to-speech conversion according to an embodiment of the present application;

fig. 4 is a schematic structural diagram of a terminal according to an embodiment of the present application;

fig. 5 is a schematic structural diagram of a server according to an embodiment of the present application.

Detailed Description

To make the objects, technical solutions and advantages of the present application more clear, embodiments of the present application will be described in further detail below with reference to the accompanying drawings.

The text-to-speech conversion method can be realized by a terminal or a server, and can also be realized by the terminal and the server together. The terminal can possess parts such as camera, microphone, earphone, and the terminal has communication function, can insert the internet, and the terminal can be cell-phone, panel computer, intelligent wearing equipment, desktop computer, notebook computer etc.. The server can be a background server of the application program, and the server can be communicated with the terminal. The server may be a single server or a server group, and if the server is a single server, the server may be responsible for all processing in the following scheme, and if the server is a server group, different servers in the server group may be respectively responsible for different processing in the following scheme, and the specific processing allocation condition may be arbitrarily set by a technician according to actual needs, and is not described herein again.

The method provided by the embodiment of the application can be applied to a listening book application program, and particularly can be used for converting the text which needs to be read by the user into the human voice audio, so that the user can know the content in the text through the human voice audio.

Fig. 1 is a flowchart of a method for text-to-speech conversion according to an embodiment of the present application. Referring to fig. 1, the embodiment includes:

step 101, determining at least one text segment included in the target text.

The target text may be articles such as poem, prose, novels, and the like, or may be only one segment of text, for example: dialog or scene description languages. The text segment is the minimum unit for converting the voice into the characters, and the text segment can be a chapter, a paragraph or a sentence.

In an application scene, when a user opens a book listening application program, the user selects an electronic book or an article which the user wants to read, and the terminal displays a current reading interface. And the user clicks a reading button on the current reading interface or presses the current reading interface for a long time to trigger the reading instruction. And after receiving the reading instruction, the terminal starts to read the characters displayed on the current interface. And after the characters displayed on the current interface are read aloud, refreshing the characters displayed on the current interface and displaying the characters after the characters are displayed. When the characters displayed on the current interface are read, the characters being read are highlighted, so that the user can clearly know the characters being read.

Optionally, the text part of the dialog type and the text part of the non-dialog type in the target text are divided in different manners, and the division rule is as follows: if the target text comprises a text part of the conversation type, dividing the text part of the conversation type based on the switching of conversation roles to obtain at least one text fragment; and if the target text comprises the text part of the non-dialog type, dividing the text part of the non-dialog type based on chapters or paragraphs to obtain at least one text fragment.

It should be noted that, for the text portion of the dialog type, each dialog character has its own corresponding sound, and the user can distinguish each dialog character according to the sound corresponding to each dialog character. For a non-dialog type text portion, the portion is often an intermission for a landscape description or a matter, and the portion may correspond to a sound.

Further, the method of determining the text part of the dialog type and the text part of the non-dialog type may be the following two methods.

First, a text portion of a dialog type and a text portion of a non-dialog type may be determined based on keywords and key symbols in the target text.

Whether each single sentence in the target text comprises a keyword or a key symbol is detected, the single sentence comprising the keyword or the key symbol is divided into text parts of conversation types, and the single sentence not comprising the keyword or the key symbol is divided into text parts of non-conversation types. Or, the continuous single sentences which include the keywords or the key symbols and are divided into text parts of conversation types, and the continuous single sentences which do not include the keywords or the key symbols and are divided into text parts of non-conversation types.

The keywords can be words such as a street or a yell, and the key symbols can be double quotation marks or colons. Wherein, the single sentence can be divided based on any symbol, for example, "wake up gradually when the ground is large, shoplifter can drill out from the soil, and a hazy green burst is developed in the grass," is 3 single sentences.

Second, the target text is divided by the understanding of the target text by the skilled person. When the technician considers a certain text portion to be of a dialog type, the text portion is determined to be of the dialog type. When the technician considers a text portion to be of a non-dialog type, the text portion is determined to be of a non-dialog type.

Since the first division method may have errors, the technician may detect the errors to obtain correctly divided text portions of conversation type and text portions of non-conversation type.

For a text portion that is not of a dialog type, it may be determined whether a chapter exists for the text portion by detecting whether a title exists for the text portion. Whether a paragraph exists in a text portion may be determined by detecting whether a line break exists in the text portion.

And 102, acquiring context information corresponding to each text segment, and determining at least one type of voice audio attribute information corresponding to each text segment based on the context information corresponding to each text segment.

The context information corresponding to the text part of the dialog type may include information such as gender, age, and identity of the dialog character, and may also include an atmosphere in which the dialog character is located, such as joy, sadness, and the like. The context information corresponding to the text portion of the non-dialog type may include the atmosphere described by the text, such as cheerful, sadness, etc., and may also include the environment, weather, etc.

The human voice audio attribute information includes at least one of a human voice audio type, a speech rate, a speech pitch, and a volume. Generally, when the emotion of a conversational character falls, the lower the speech rate, the lower the pitch, and the lower the volume. When the emotion fluctuation is large, the speed of speech is higher, the tone of speech is higher, and the volume is higher, for example, when the character of a conversation is happy, the speed of speech is higher, the tone of speech is higher, and the volume is higher; when the conversation character is angry, the speed of speech is highest, the tone is highest and the volume is highest. The voice audio type corresponds to the tone of the voice, the speed of the voice corresponds to the speed of the voice, the tone corresponds to the sharpness of the voice, and the volume corresponds to the height of the voice. In practice, the speech rate, the intonation, and the volume of the sound may correspond to a plurality of values, respectively, and the text may hardly reflect the specific values corresponding to the speech rate, the intonation, and the volume of the converted sound, respectively, so that the numerical ranges of the speech rate, the intonation, and the volume may be set to 0 to 100, and four stages may be divided according to the numerical ranges of the speech rate, the intonation, and the volume, respectively, so that each stage corresponds to the middle value of the numerical range in which each stage is located, respectively. Combining the speech speed, the tone and the volume corresponding to each stage, and enabling different characters to correspond to the speech speed, the tone and the volume of different combinations under different contexts.

Detecting and acquiring the context information corresponding to each text segment, and determining at least one type of voice audio attribute information corresponding to each text segment based on the context information corresponding to each text segment.

Further, for the text portion of the dialog type, the girl says "mom, i hungry. For example, the identity of the conversational character is girl, and the current context information is a low-mood context, wherein the current context information can be determined as the low-mood context according to hungry words. At this time, the current voice audio type can be determined to be the romance voice according to the identity of the conversation role. And determining the speed, tone and volume of the speech according to the context of low emotion.

For the text part of the non-conversation type, by taking the example that the text part is gradually awakened in the large ground, a thief drills out from the soil, and a hazy green color is formed in a pasture, the description is given to the spring environment, and the environment is a more beautiful environment, and the voice and audio type can be determined to be a sweet and beautiful woman voice, and the voice speed, the voice frequency and the volume are adjusted to be low.

Alternatively, the context information determined as the text segment may be determined in the following three ways:

in the first mode, the context information corresponding to the text segment is determined according to the context information corresponding to the keywords in the text segment.

The technical personnel preset at least one keyword corresponding to each type of contextual information and establish a keyword set aiming at each type of contextual information. And detecting keywords in the text segments, and determining context information corresponding to each keyword. And determining a plurality of keywords corresponding to each type of context information based on the context information corresponding to each keyword. And determining the occurrence frequency of each keyword corresponding to each type of contextual information, adding the occurrence frequencies of all keywords corresponding to each type of contextual information, and determining the total occurrence frequency of all keywords corresponding to each type of contextual information. And determining the contextual information with the maximum total occurrence times of the corresponding keywords as the contextual information corresponding to the text segment. Or after detecting the keywords in the text segment, determining the occurrence frequency of each keyword in the text segment, adding the occurrence frequencies of each keyword in the text segment, further determining the total occurrence frequency of all the keywords in the text segment, and simultaneously determining the total occurrence frequency of all the keywords corresponding to each type of context information. And screening a ratio which is greater than a preset ratio according to the ratio of the total occurrence frequency of all the keywords corresponding to each piece of context information to the total occurrence frequency of all the keywords in the text segment, determining the maximum ratio in the screened ratios, and taking the context information corresponding to the maximum ratio as the context information corresponding to the text segment.

In a second mode, according to the information of at least one dimension included in the context information, a keyword corresponding to each dimension is determined, and then the information of at least one dimension of the text segment is determined.

Wherein the context information may comprise information of at least one dimension, such as atmosphere, weather, environment, etc.

For the information of the dimension of the atmosphere in the context information, various atmosphere information can be divided according to the atmosphere dimension, such as joy, sadness, relaxation, tension and the like, at least one keyword corresponding to each atmosphere information is preset, and a keyword set is established for each atmosphere information. And detecting the keywords in a certain text segment according to the keyword set corresponding to all the atmosphere information, and determining the atmosphere information corresponding to each keyword. And determining each keyword corresponding to each atmosphere information based on the atmosphere information corresponding to each keyword. And determining the occurrence frequency of each keyword corresponding to each atmosphere information, adding the occurrence frequencies of all keywords corresponding to each atmosphere information, and determining the total occurrence frequency of all keywords corresponding to each atmosphere information. And determining the atmosphere information with the maximum corresponding total occurrence frequency as the atmosphere information corresponding to the text segment. Or after determining the total occurrence number of all keywords corresponding to each atmosphere information, adding the total occurrence number of all keywords corresponding to each atmosphere information, and further determining the total occurrence number of all keywords in the text segment. And determining the maximum ratio of all ratios according to the ratio of the total occurrence times corresponding to all the keywords corresponding to each atmosphere information to the total occurrence times corresponding to all the keywords in the text segment, and taking the atmosphere information corresponding to the maximum ratio as the atmosphere information corresponding to the text segment.

For example: the atmosphere is divided into three kinds of atmosphere information, namely sadness, happiness and anger, wherein in a certain text segment, keywords corresponding to sadness are ' tear ' and ' cry ', ' the number of times of tear ' occurrence is 4 ', ' the number of times of cry ' occurrence is 3, and the total number of times of occurrence of all keywords corresponding to sadness is 7. The keyword corresponding to happy is smile, the number of occurrences of smile is 1, and the total number of occurrences of all keywords corresponding to happy is 1. Keywords corresponding to anger are "loud", and "roar", the number of times of occurrence of "loud" is 1, the number of times of occurrence of "roar" is 1, and the total number of times of occurrence of all keywords corresponding to anger is 2. It can be known that the total occurrence frequency of all keywords corresponding to sadness is the maximum, and sadness is taken as the atmosphere information corresponding to the text segment.

For information of the dimension of weather in the context information, a plurality of weather information can be divided according to the weather dimension, such as sunny days, rainy days, cloudy days, snowy days and the like, at least one keyword corresponding to each weather information is preset, and a keyword set is established for each weather information. And detecting keywords in a certain text segment according to the keyword set corresponding to all weather information, and determining the weather information corresponding to each keyword. And determining a plurality of keywords corresponding to each type of weather information based on the gender corresponding to each keyword. And determining the occurrence frequency of each keyword corresponding to each weather information, adding the occurrence frequencies of all the keywords corresponding to each weather information, and determining the total occurrence frequency of all the keywords corresponding to each weather information. And determining the weather information with the maximum total occurrence frequency as the weather information corresponding to the text segment. Or after the total occurrence number of all the keywords corresponding to each type of weather information is determined, adding the total occurrence number of all the keywords corresponding to each type of weather information, and further determining the total occurrence number of all the keywords in the text segment. And determining the maximum ratio of all ratios according to the ratio of the total occurrence times corresponding to all the keywords corresponding to each type of weather information to the total occurrence times corresponding to all the keywords in the text segment, and taking the weather information corresponding to the maximum ratio as the weather information corresponding to the text segment.

For example, the weather is divided into three types of weather information, which are clear weather, cloudy weather and rainy weather, respectively, wherein in a certain text segment, the keywords corresponding to the clear weather are "sun", "dry", "burnt cloud" and "moon", "the number of times of occurrence of the sun" is 3, "the number of times of occurrence of the dry" is 2, "the number of times of occurrence of the burnt cloud" is 2, "the number of times of occurrence of the moon" is 4, and the total number of times of occurrence of all the keywords corresponding to the clear weather is 11. The keywords corresponding to cloudy days are 'cloud pressure', the frequency of appearance of the 'cloud pressure' is 1, and the total frequency of appearance of all the keywords corresponding to cloudy days is 1. The keyword corresponding to rainy days is not detected, and therefore, the total number of occurrences of the keyword corresponding to rainy days is 0. As can be seen from the above, the total occurrence frequency of all the keywords corresponding to the sunny day is far greater than the total occurrence frequency of all the keywords corresponding to the cloudy day, and is greater than the total occurrence frequency of the keywords corresponding to the rainy day, so that the sunny day is used as the weather information corresponding to the text segment.

For the information of the dimension of the environment in the context information, two types of environment information can be divided according to the dimension of the environment, such as a non-natural environment and a natural environment, at least one keyword corresponding to each type of environment information is preset, and a keyword set is established for each type of environment information. And detecting the keywords in a certain text segment according to the keyword set corresponding to all the environment information, and determining the environment information corresponding to each keyword. And determining each keyword corresponding to each environment information based on the environment information corresponding to each keyword. Determining the occurrence frequency of each keyword corresponding to each environmental information, adding the occurrence frequencies of all keywords corresponding to each environmental information, and determining the total occurrence frequency of all keywords corresponding to each environmental information. And determining the environment information corresponding to the text segment with the maximum total occurrence number as the environment information corresponding to the text segment. Or after determining the total occurrence number of all the keywords corresponding to each piece of environment information, adding the total occurrence number of all the keywords corresponding to each piece of environment information, and further determining the total occurrence number of all the keywords in the text segment. And determining the maximum ratio of all ratios according to the ratio of the total occurrence times corresponding to all the keywords corresponding to each piece of environment information to the total occurrence times corresponding to all the keywords in the text segment, and taking the environment information corresponding to the maximum ratio as the environment information corresponding to the text segment.

For example, the environment is divided into two types of environment information, namely a natural environment and a non-natural environment, wherein in a certain text segment, the keywords corresponding to the natural environment are "bird", "peach blossom", "green grass" and "willow", the number of times of appearance of the bird "is 4, the number of times of appearance of the peach blossom" is 5, the number of times of appearance of the green grass "is 4, the number of times of appearance of the willow" is 3, and the total number of times of appearance of all the keywords corresponding to the natural environment is 16. The keywords corresponding to the unnatural environment are 'tall building', 'road' and 'vehicle', 'the number of times of occurrence of the tall building' is 1, 'the number of times of occurrence of the road' is 2, 'the number of times of occurrence of the vehicle' is 1, and the total number of times of occurrence of all the keywords corresponding to the unnatural environment is 4. As can be seen from the above, the total occurrence frequency of all the keywords corresponding to the natural environment is greater than the total occurrence frequency of all the keywords corresponding to the unnatural environment, so that the natural environment is used as the environment information corresponding to the text segment.

In a third way, information of at least one dimension included in the context information of the text segment is determined based on the machine learning model.

Because the context information of the text segment can include information of at least one dimension, machine learning models corresponding to different dimensions are established, and then the information of at least one dimension included in the context information of the text segment is determined. Determining the atmosphere corresponding to the text segment, for example, based on the trained first semantic model and any text segment; determining weather corresponding to the text segment based on the trained second semantic model and any text segment; and determining the environment corresponding to the text segment and the like based on the trained third semantic model and any text segment.

Taking the example of determining the atmosphere corresponding to the text segment based on the trained first semantic model and any text segment, inputting the text segment into the trained first semantic model, and outputting the atmosphere corresponding to the text segment.

Further, the specific steps of obtaining the trained first semantic model are as follows: establishing a sample text library comprising a large number of text fragments with different atmospheres; any text segment in the sample text library is obtained, the text segment is input into the first semantic model, and the atmosphere corresponding to the text segment is output. Based on the understanding of the technical personnel to the text segment, the standard atmosphere corresponding to the text segment is determined. And comparing the output atmosphere with the standard atmosphere, and finishing a training process if the output atmosphere is the same as the standard atmosphere. And if the output atmosphere is different from the standard atmosphere, adjusting parameters in the first semantic model based on the difference information between the output atmosphere and the standard atmosphere to finish a training process. And repeating the training process for a preset number of times to obtain the first semantic model after training.

The training mode of the second semantic model and the third semantic model is similar to the training mode of the first semantic model, and is not discussed here.

In a fourth way, context information of a text segment is determined by the skilled person's understanding of the text segment.

It should be noted that, according to the fifth way, the technician can correct the context information obtained in the first four ways.

And 103, converting each text segment into a human voice audio segment respectively based on at least one human voice audio attribute information corresponding to each text segment.

And determining the voice audio type, the speech speed, the tone and the volume corresponding to each text segment according to at least one voice audio attribute information corresponding to each text segment. And respectively converting each text segment into a human voice audio segment based on the human voice audio type, the speech speed, the language and the volume corresponding to each text segment.

The voice and audio types of the person can include a Roly sound, a Miss sound, a mature male sound, a sweet female sound and the like.

Further, for the text portion of the dialog type, it is determined that the girl says "mom, i hungry. The voice type of the text segment is the Lolira sound, and the speed, the tone and the volume corresponding to the text segment convert the 'mother, I hungry' into the human voice audio segment based on the Lolira sound and the speed, the tone and the volume corresponding to the text segment.

For the text part of the non-conversation type, the type of the sound of the text fragment of 'getting awake gradually when the user is large, getting out of the soil when the user is stolen, and getting a hazy green when the user is in a grove' is determined as the sweet female sound, and the speech speed, the tone and the volume corresponding to the text fragment. Based on the speed, tone and volume of the corresponding sweet female voice and the text fragment, the method converts the 'waking up gradually when the land is large, the shoplifter drills out from the soil, and a hazy green burst is developed in the grove' into the voice audio fragment.

Taking fig. 2 as an example, according to the context information corresponding to the text segment 1, it is determined that the type of the human voice and the audio corresponding to the text segment 1 is a sweet female voice, and the speed, the tone and the volume corresponding to the text segment 1. And determining the type of the human voice and the audio corresponding to the text segment 2 as mature male voice and the speed, the tone and the volume corresponding to the text segment 2 according to the context information corresponding to the text segment 2. According to the context information corresponding to the text segment 3, the voice and audio type corresponding to the text segment 3 is a rauli sound, and the speed, the tone and the volume corresponding to the text segment 3.

Optionally, based on the context information corresponding to each text segment and the correspondence between the pre-stored context information and the music audio, obtaining the music audio corresponding to each text segment; synthesizing the music audio corresponding to each text segment with the voice audio segment corresponding to each text segment to obtain the voice audio segment which is added with the music audio and corresponds to each text segment; and synthesizing the voice audio frequency of the added music audio frequency corresponding to the target text according to the arrangement sequence of each text segment in the target text.

Further, for the text part of the conversation type, it is determined that the girl says 'mom, i hungry' that the context of the text segment is the context of low emotion, the music audio corresponding to the context of low emotion is determined, for example, 'who hungry', and the music audio corresponding to 'who hungry' is synthesized with the voice audio segment of 'mom, i hungry' to obtain the voice audio added with the music audio.

For a text part of a non-conversation type, determining that the text part is gradually revived in the ground, a thief drills out from the ground, a text fragment with dim green color widely spread in a grass is an easy context, determining a music audio corresponding to the context, such as a piano music I love you, and synthesizing the music audio corresponding to the text fragment with dim green color widely spread in the grass, so as to obtain a voice audio added with the music audio.

Wherein, the technical personnel preset who is hungry to correspond to the context of low emotion, and I love you corresponds to the context of relaxation.

It should be noted that music segments are added to highlight the atmosphere in the text segments, so that the voice and audio of the synthesized target text are more vivid, the user experience is improved, and the converted audio is more vivid and interesting.

Optionally, for each text segment, determining a voice duration corresponding to a voice audio segment corresponding to the text segment, and determining a music audio segment with duration equal to the voice duration based on a music audio corresponding to the text segment; and superposing and synthesizing the music audio segments corresponding to the text segments and the voice audio segments to obtain the voice audio segments which are added with the music audio and correspond to the text segments.

And 104, synthesizing the voice audio frequency corresponding to the target text according to the arrangement sequence of each text segment in the target text by using the voice audio frequency segment corresponding to each text segment.

According to the method and the device, the target text is divided into the text segments, the context information corresponding to each text segment is determined, at least one type of voice audio attribute corresponding to each text segment is further determined, each text segment is converted into the voice audio segment, each voice audio segment is synthesized, the life audio corresponding to the target text is obtained, the converted voice audio can be correspondingly changed based on the change of the context, and the voice conversion is more flexible.

Based on the same technical concept, an embodiment of the present application further provides an apparatus for text-to-speech conversion, as shown in fig. 3, the apparatus includes:

a first determination module 310 configured to determine at least one text segment included in the target text;

the second determining module 320 is configured to obtain context information corresponding to each text segment, and determine at least one type of vocal audio attribute information corresponding to each text segment based on the context information corresponding to each text segment;

the conversion module 340 is configured to convert each text segment into a human voice audio segment respectively based on at least one human voice audio attribute information corresponding to each text segment;

and the synthesizing module 340 is configured to synthesize the human voice audio corresponding to the target text from the human voice audio corresponding to each text segment according to the arrangement sequence of each text segment in the target text.

Optionally, the first determining module 310 is configured to:

if the target text comprises a text part of the conversation type, dividing the text part of the conversation type based on the switching of conversation roles to obtain at least one text fragment;

and if the target text comprises the text part of the non-dialog type, dividing the text part of the non-dialog type based on chapters or paragraphs to obtain at least one text fragment.

at least one of a voice audio type, a speech rate, a tone, and a volume.

Optionally, the apparatus further includes an adding module configured to:

a synthesis module 340 configured to:

Optionally, the adding module is configured to:

for each text segment, determining the voice duration corresponding to the voice audio segment corresponding to the text segment, and determining the music audio segment with the duration equal to the voice duration based on the music audio corresponding to the text segment;

It should be noted that: in the text-to-speech conversion apparatus provided in the foregoing embodiment, only the division of the functional modules is illustrated in the text-to-speech conversion, and in practical applications, the functions may be distributed by different functional modules according to needs, that is, the internal structure of the apparatus is divided into different functional modules to complete all or part of the functions described above. In addition, the text-to-speech conversion apparatus provided in the above embodiment and the text-to-speech conversion method embodiment belong to the same concept, and specific implementation processes thereof are described in the method embodiment and are not described herein again.

Fig. 4 shows a block diagram of a terminal 400 according to an exemplary embodiment of the present application, where the terminal 400 may be a smart phone, a tablet computer, an MP3 player (Moving Picture Experts Group Audio L layer III, mpeg Audio layer 3), an MP4 player (Moving Picture Experts Group Audio L layer iv, mpeg Audio layer 4), a notebook computer, or a desktop computer, and the terminal 400 may also be referred to as a user equipment, a portable terminal, a laptop terminal, a desktop terminal, or other names.

Generally, the terminal 400 includes: a processor 401 and a memory 402.

Processor 401 may include one or more Processing cores, such as a 4-core processor, an 8-core processor, etc. processor 401 may be implemented in at least one hardware form of a DSP (Digital Signal Processing), an FPGA (Field-Programmable Gate Array), a P L a (Programmable logic Array), processor 401 may also include a main processor and a coprocessor, the main processor being a processor for Processing data in a wake-up state, also known as a CPU (Central Processing Unit), the coprocessor being a low-power processor for Processing data in a standby state, in some embodiments, processor 401 may be integrated with a GPU (Graphics Processing Unit) for rendering and rendering content for display, in some embodiments, processor 401 may also include an AI (intelligent processor) for learning operations related to an AI (Artificial Intelligence processor) for computing operations related to display screens.

Memory 402 may include one or more computer-readable storage media, which may be non-transitory. Memory 402 may also include high speed random access memory, as well as non-volatile memory, such as one or more magnetic disk storage devices, flash memory storage devices. In some embodiments, a non-transitory computer readable storage medium in memory 402 is used to store at least one instruction for execution by processor 401 to implement the method of text-to-speech conversion provided by method embodiments herein.

In some embodiments, the terminal 400 may further optionally include: a peripheral interface 403 and at least one peripheral. The processor 401, memory 402 and peripheral interface 403 may be connected by bus or signal lines. Each peripheral may be connected to the peripheral interface 403 via a bus, signal line, or circuit board. Specifically, the peripheral device includes: at least one of radio frequency circuitry 404, touch screen display 405, camera 406, audio circuitry 407, positioning components 408, and power supply 409.

The peripheral interface 403 may be used to connect at least one peripheral related to I/O (Input/Output) to the processor 401 and the memory 402. In some embodiments, processor 401, memory 402, and peripheral interface 403 are integrated on the same chip or circuit board; in some other embodiments, any one or two of the processor 401, the memory 402 and the peripheral interface 403 may be implemented on a separate chip or circuit board, which is not limited by this embodiment.

The Radio Frequency circuit 404 is used for receiving and transmitting RF (Radio Frequency) signals, also called electromagnetic signals. The radio frequency circuitry 404 communicates with communication networks and other communication devices via electromagnetic signals. The rf circuit 404 converts an electrical signal into an electromagnetic signal to transmit, or converts a received electromagnetic signal into an electrical signal. Optionally, the radio frequency circuit 404 includes: an antenna system, an RF transceiver, one or more amplifiers, a tuner, an oscillator, a digital signal processor, a codec chipset, a subscriber identity module card, and so forth. The radio frequency circuitry 404 may communicate with other terminals via at least one wireless communication protocol. The wireless communication protocols include, but are not limited to: metropolitan area networks, various generation mobile communication networks (2G, 3G, 4G, and 5G), Wireless local area networks, and/or WiFi (Wireless Fidelity) networks. In some embodiments, the rf circuit 404 may further include NFC (Near Field Communication) related circuits, which are not limited in this application.

The Display 405 may be used to Display a UI (User Interface) that may include graphics, text, icons, video, and any combination thereof, when the Display 405 is a touch screen, the Display 405 may also have the ability to capture touch signals on or over the surface of the Display 405. the touch signals may be input to the processor 401 for processing as control signals, at which time the Display 405 may also be used to provide virtual buttons and/or a virtual keyboard, also referred to as soft buttons and/or a soft keyboard. in some embodiments, the Display 405 may be one, providing the front panel of the terminal 400, in other embodiments, the Display 405 may be at least two, each disposed on a different surface of the terminal 400 or in a folded design, in still other embodiments, the Display 405 may be a flexible Display disposed on a curved surface or on a folded surface of the terminal 400. even, the Display 405 may be provided in non-rectangular irregular graphics, the Display 405 may be provided in L CD (L idCry Display, Display L), Emotig-Diode, or the like.

The camera assembly 406 is used to capture images or video. Optionally, camera assembly 406 includes a front camera and a rear camera. Generally, a front camera is disposed at a front panel of the terminal, and a rear camera is disposed at a rear surface of the terminal. In some embodiments, the number of the rear cameras is at least two, and each rear camera is any one of a main camera, a depth-of-field camera, a wide-angle camera and a telephoto camera, so that the main camera and the depth-of-field camera are fused to realize a background blurring function, and the main camera and the wide-angle camera are fused to realize panoramic shooting and VR (Virtual Reality) shooting functions or other fusion shooting functions. In some embodiments, camera assembly 406 may also include a flash. The flash lamp can be a monochrome temperature flash lamp or a bicolor temperature flash lamp. The double-color-temperature flash lamp is a combination of a warm-light flash lamp and a cold-light flash lamp, and can be used for light compensation at different color temperatures.

The audio circuit 407 may include a microphone and a speaker. The microphone is used for collecting sound waves of a user and the environment, converting the sound waves into electric signals, and inputting the electric signals to the processor 401 for processing, or inputting the electric signals to the radio frequency circuit 404 for realizing voice communication. For the purpose of stereo sound collection or noise reduction, a plurality of microphones may be provided at different portions of the terminal 400. The microphone may also be an array microphone or an omni-directional pick-up microphone. The speaker is used to convert electrical signals from the processor 401 or the radio frequency circuit 404 into sound waves. The loudspeaker can be a traditional film loudspeaker or a piezoelectric ceramic loudspeaker. When the speaker is a piezoelectric ceramic speaker, the speaker can be used for purposes such as converting an electric signal into a sound wave audible to a human being, or converting an electric signal into a sound wave inaudible to a human being to measure a distance. In some embodiments, audio circuitry 407 may also include a headphone jack.

The positioning component 408 is used to locate the current geographic location of the terminal 400 to implement navigation or L BS (L geographic based Service). the positioning component 408 may be a positioning component based on the united states GPS (global positioning System), the beidou System of china, the greiner System of russia, or the galileo System of the european union.

The power supply 409 is used to supply power to the various components in the terminal 400. The power source 409 may be alternating current, direct current, disposable or rechargeable. When power source 409 comprises a rechargeable battery, the rechargeable battery may support wired or wireless charging. The rechargeable battery may also be used to support fast charge technology.

In some embodiments, the terminal 400 also includes one or more sensors 410. The one or more sensors 410 include, but are not limited to: acceleration sensor 411, gyro sensor 412, pressure sensor 413, fingerprint sensor 414, optical sensor 415, and proximity sensor 416.

The acceleration sensor 411 may detect the magnitude of acceleration in three coordinate axes of the coordinate system established with the terminal 400. For example, the acceleration sensor 411 may be used to detect components of the gravitational acceleration in three coordinate axes. The processor 401 may control the touch display screen 405 to display the user interface in a landscape view or a portrait view according to the gravitational acceleration signal collected by the acceleration sensor 411. The acceleration sensor 411 may also be used for acquisition of motion data of a game or a user.

The gyro sensor 412 may detect a body direction and a rotation angle of the terminal 400, and the gyro sensor 412 may cooperate with the acceleration sensor 411 to acquire a 3D motion of the terminal 400 by the user. From the data collected by the gyro sensor 412, the processor 401 may implement the following functions: motion sensing (such as changing the UI according to a user's tilting operation), image stabilization at the time of photographing, game control, and inertial navigation.

The pressure sensor 413 may be disposed on a side bezel of the terminal 400 and/or a lower layer of the touch display screen 405. When the pressure sensor 413 is disposed on the side frame of the terminal 400, a user's holding signal to the terminal 400 can be detected, and the processor 401 performs left-right hand recognition or shortcut operation according to the holding signal collected by the pressure sensor 413. When the pressure sensor 413 is disposed at the lower layer of the touch display screen 405, the processor 401 controls the operability control on the UI interface according to the pressure operation of the user on the touch display screen 405. The operability control comprises at least one of a button control, a scroll bar control, an icon control and a menu control.

The fingerprint sensor 414 is used for collecting a fingerprint of a user, and the processor 401 identifies the identity of the user according to the fingerprint collected by the fingerprint sensor 414, or the fingerprint sensor 414 identifies the identity of the user according to the collected fingerprint, when the identity of the user is identified as a trusted identity, the processor 401 authorizes the user to perform relevant sensitive operations, wherein the sensitive operations include screen unlocking, encrypted information viewing, software downloading, payment, setting change and the like, the fingerprint sensor 414 can be arranged on the front side, the back side or the side of the terminal 400, and when a physical key or a vendor L ogo is arranged on the terminal 400, the fingerprint sensor 414 can be integrated with the physical key or the vendor L ogo.

The optical sensor 415 is used to collect the ambient light intensity. In one embodiment, the processor 401 may control the display brightness of the touch display screen 405 based on the ambient light intensity collected by the optical sensor 415. Specifically, when the ambient light intensity is high, the display brightness of the touch display screen 405 is increased; when the ambient light intensity is low, the display brightness of the touch display screen 405 is turned down. In another embodiment, the processor 401 may also dynamically adjust the shooting parameters of the camera assembly 406 according to the ambient light intensity collected by the optical sensor 415.

A proximity sensor 416, also known as a distance sensor, is typically disposed on the front panel of the terminal 400. The proximity sensor 416 is used to collect the distance between the user and the front surface of the terminal 400. In one embodiment, when the proximity sensor 416 detects that the distance between the user and the front surface of the terminal 400 gradually decreases, the processor 401 controls the touch display screen 405 to switch from the bright screen state to the dark screen state; when the proximity sensor 416 detects that the distance between the user and the front surface of the terminal 400 gradually becomes larger, the processor 401 controls the touch display screen 405 to switch from the breath screen state to the bright screen state.

Those skilled in the art will appreciate that the configuration shown in fig. 4 is not intended to be limiting of terminal 400 and may include more or fewer components than those shown, or some components may be combined, or a different arrangement of components may be used.

Fig. 5 is a schematic structural diagram of a server according to an embodiment of the present application, where the server 500 may include one or more processors (CPUs) 501 and one or more memories 502, where at least one instruction is stored in the memory 502, and is loaded and executed by the processors 501 to implement the methods provided by the foregoing method embodiments. Of course, the server may also have components such as a wired or wireless network interface, a keyboard, and an input/output interface, so as to perform input/output, and the server may also include other components for implementing the functions of the device, which are not described herein again.

In an exemplary embodiment, a computer-readable storage medium, such as a memory, is also provided that includes instructions executable by a processor in a terminal to perform the method of text-to-speech in the above-described embodiments. For example, the computer readable storage medium may be a ROM, a Random Access Memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.

It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program instructing relevant hardware, where the program may be stored in a computer-readable storage medium, and the above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.

The above description is only exemplary of the present application and should not be taken as limiting the present application, as any modification, equivalent replacement, or improvement made within the spirit and principle of the present application should be included in the protection scope of the present application.

Claims

1. A method of text to speech conversion, the method comprising:

determining at least one text segment included in the target text;

2. The method of claim 1, wherein determining at least one text passage included in the target text comprises:

3. The method of claim 1, wherein the at least one human voice audio attribute information comprises:

at least one of a voice audio type, a speech rate, a tone, and a volume.

4. The method according to claim 1, wherein after converting each of the text segments into a human voice audio segment, further comprising:

5. The method according to claim 4, wherein the synthesizing the music audio segment corresponding to each text segment with the human voice audio segment corresponding to each text segment to obtain the human voice audio segment with music audio added corresponding to each text segment includes:

6. An apparatus for text to speech conversion, the apparatus comprising:

7. The apparatus of claim 6, wherein the first determining module is configured to:

8. The apparatus of claim 6, wherein the at least one human voice audio attribute information comprises:

at least one of a voice audio type, a speech rate, a tone, and a volume.

9. A computer device comprising a processor and a memory, the memory having stored therein at least one instruction that is loaded and executed by the processor to perform operations performed by the method of text to speech conversion according to any of claims 1 to 5.

10. A computer-readable storage medium having stored therein at least one instruction which is loaded and executed by a processor to perform operations performed by a method of text to speech conversion according to any one of claims 1 to 5.