WO2020024582A1

WO2020024582A1 - Speech synthesis method and related device

Info

Publication number: WO2020024582A1
Application number: PCT/CN2019/076552
Authority: WO
Inventors: 包飞; 邓利群; 孙文华; 曾毓珑; 魏建生; 胡月志; 黄茂胜; 黄雪妍; 李志刚
Original assignee: 华为技术有限公司
Priority date: 2018-07-28
Filing date: 2019-02-28
Publication date: 2020-02-06
Also published as: CN108962217B; CN108962217A

Abstract

A speech synthesis method and a related device. The method comprises: determining a user identity according to a current input speech of the user; acquiring an acoustic model from an acoustic model library (1033) according to the current input speech; determining basic speech synthesis information according to the user identity, the basic speech synthesis information representing variations in a preset sound speed, a preset volume, and preset pitch of the acoustic model; determining a reply text; determining enhanced speech synthesis information according to the reply text and contextual information, the enhanced speech synthesis information representing variations in a preset timbre, an intonation, and a preset rhythm of the acoustic model; and using the acoustic model to perform speech synthesis on the reply text according to the basic speech synthesis information and the enhanced speech synthesis information so as to acquire a speech used as a reply to the user. The method enables a device to provide a personalized speech synthesis effect for a user in a process of human-computer interaction, thereby improving the speech interaction experience of the user.

Description

Speech synthesis method and related equipment

Technical field

The invention relates to the field of speech processing, in particular to a speech synthesis method and related equipment.

Background technique

In recent years, human-computer dialogue has begun to widely enter people's daily life. Common scenarios include intelligent customer service robots, smart speakers, chat robots, and so on. The core of human-computer dialogue is that the machine can automatically understand and analyze the voice input by the user based on the data of the training or learning under the framework of the built system, and give a meaningful voice response. When designing a speech synthesis system for Chinese text, if you just match the input text with the pronunciation database one by one, and connect the pronunciation of all characters in series to form a speech output, then such speech will be mechanically stiff, without inflections. The hearing experience is poor. The TTS (text-to-speed) engine developed in recent years is a speech synthesis technology based on reading rules. Using the TTS engine for speech synthesis can handle the natural transition of single words / words and the transition of mood. , Making the machine answer the voice closer to the human voice.

At present, in the prior art, the machine is limited to "speaking like a human" in the process of human-computer interaction, and does not consider the diverse needs of users for human-computer interaction.

Summary of the invention

The embodiments of the present invention provide a speech synthesis method and related equipment, so that the machine can provide a personalized speech synthesis effect for the user according to user preferences or the requirements of the dialogue environment during the human-machine interaction process, improve the timeliness of the human-machine dialogue, and increase the user. Voice interaction experience.

In a first aspect, an embodiment of the present invention provides a speech synthesis method that can be applied to a terminal device, including: the terminal device receives a user's current input voice, and determines the identity of the user according to the user's current input voice; The current input voice is obtained from an acoustic model library preset in the terminal device. The preset information of the acoustic model includes a preset sound speed, a preset volume, a preset pitch, a preset tone color, and a preset. Two or more of the intonation and the preset rhythm; the terminal device determines basic speech synthesis information according to the identity of the user, and the identity of the user is associated with corresponding basic speech synthesis information. The speech synthesis information may also be referred to as a basic TTS parameter. The basic TTS parameter is used to characterize one or more of a preset sound velocity, a preset volume, and the pitch of an acoustic model used in speech synthesis. The current input voice determines the reply text; the terminal device according to the reply text, or the reply text and context information Determine the enhanced speech synthesis information. The enhanced speech synthesis information described in the embodiments of the present invention may also be referred to as enhanced TTS parameters. The enhanced TTS parameters are used to characterize preset tones, preset tones, and Preset the amount of change of one or more of the prosody; in the embodiment of the present invention, the terminal device can determine the dialogue scene of the current dialogue according to the reply text, or according to the reply text and the context information of the currently input voice. ; The terminal device, through the acoustic model (including preset information of the acoustic model), performs speech synthesis on the reply text according to the basic speech synthesis information and the enhanced speech synthesis information, and obtains the Reply to the voice, thus realizing the real-time dialog interaction between the terminal device and the user. That is, in the embodiment of the present invention, the acoustic model can convert the reply text into the reply voice according to the preset information of the acoustic model and the change information of the preset information.

Optionally, the acoustic model library may include multiple acoustic models (for example, a general acoustic model, a personalized acoustic model, etc.). These acoustic models are all neural network models, and these neural network models can be trained in advance from different corpora. For each acoustic model, each acoustic model has its own preset information, that is, each acoustic model is bound to a specific preset information, and these preset information can be used as the basic input of the acoustic model. information.

Optionally, since the identity of the user may be related to the personal preferences of the user, the terminal may also determine the basic speech synthesis information according to the personal preference of the user.

In the embodiment of the present invention, the context information may represent a context of a current input voice or a historical input voice before the current input voice.

It can be seen that in implementing the technical solution of the embodiment of the present invention, in the human-machine voice interaction between the user and the terminal device, the terminal device generates a corresponding reply text based on the user's input voice on the one hand, and can respond to the interactive text based on the dialogue on the other hand And dialog context information, select personalized TTS parameters (TTS parameters include basic TTS parameters and enhanced TTS parameters) based on the current user's identity, preferences, and dialog scenarios, and the terminal device can then use the selected TTS parameters to pass the selected The acoustic model is used to generate a reply speech in a specific style, so as to present a personalized speech synthesis effect to the user, greatly improving the user's voice interaction experience with the terminal, and improving the timeliness of human-machine dialogue.

Based on the first aspect, in a possible implementation manner, the terminal device also allows the user to tune the terminal device in real time through voice, and update the TTS parameters associated with the identity and preferences of the user, including updating the basic TTS parameters and strengthening the TTS parameters, so that the tuned out The terminal is closer to the user's interaction preferences and maximizes the user interaction experience.

Based on the first aspect, in a possible implementation manner, the enhanced TTS parameters may be further classified into speech emotion parameters and speech scene parameters. The speech emotion parameters are used to make the speech synthesized by the acoustic model present specific emotional characteristics. According to the different emotional characteristics, the speech emotion parameters can be further classified into neutral emotion, mild happiness, moderate happiness, extreme happiness, light Sadness, moderate sadness and other parameters. The speech scene parameters are used to make the speech synthesized by the acoustic model present specific scene characteristics. According to different scene characteristics, the speech scene parameters can be further divided into daily conversation, poetry recitation, song humming, storytelling, News broadcast and other parameters, that is to say, the use of these voice scene parameters in speech synthesis will enable the synthesized speech to show the sound effects of daily dialogue, poetry recitation, song humming, storytelling, news broadcast and other voice scenes.

The following uses "poem recitation" as an example to describe an implementation manner of using voice scene parameters related to "poem recitation" in speech synthesis.

In the embodiment of the present invention, the manner of determining the current scene as a voice scene of "poem recitation" may include:

(1) During the dialogue, the user's input voice contains the user's intention to clearly indicate that the current dialogue is a "poem recitation" voice scene;

(2) In ordinary conversation, although the user has no clear intention to clearly indicate that the current conversation is "poem recitation", the terminal device can still determine whether the content of the reply text involves a particular literary style such as poetry, words, tunes, fu, etc. One or more types, such as five-character quatrains or seven-character quatrains or rhythmic poems, or specific word or song cards;

(3) The terminal device stores in advance literary style characteristics such as the number of words, the number of sentences, and the order of the number of words in each sentence, and analyzes the punctuation (pause), number of words, and number of sentences in the reply text. And the sequence of the number of words in each sentence, match a paragraph or all of the text in the reply text with the pre-stored literary style feature. If the match is successful, the paragraph or all text that meets the pre-stored literary style feature can be used as Text using a "poem recitation" voice scene.

In the embodiment of the present invention, the speech scene of "Poetry Recitation" focuses on the rhythm of speech, and the speech scene parameters of "Poetry Recitation" are used to adjust the speech pause position / pause time of input text that conforms to a specific literary style (or syntax format) That is, the word segmentation of the text content), the length of the single word or word reading, and the position of the accent, so as to strengthen the rhythm. Compared with the natural rhythm of normal dialogues, the enhanced rhythmic rhythm has a clearer and stronger emotional expression. For example, when reading specific poems, nursery rhymes, and other specific syntactic formats, the enhanced rhythmic rhythm can produce "Suppression and frustration" feeling.

In specific implementation, the speech scene parameters of "Poetry Recitation" can be realized by a rhythmic rhythm template, and for each specific literary style of text content, it can correspond to a rhythmic rhythm template. The literary style characterizes the genre of poetry, for example, the literary style is ancient poetry, near-style poetry (such as five-character quatrains, seven-character quatrains), rhythm poetry (such as five-character verses, seven-character verses), and words (such as small orders, middle words, Long words), tunes (including various tunes, song cards, etc.), for each rhythmic rhythm template, it defines the volume change of the word at each position in the template (that is, the weight of the word) and the change of the sound length ( That is, the length of time the word is pronounced), the pause position / pause time of the speech in the text (that is, the word segmentation of the text content), and so on.

Specifically, in a possible implementation manner, when the terminal determines that the current conversation is in a "poem recitation" voice scene according to the reply text and context information, the process of the terminal determining the enhanced speech synthesis information according to the reply text and context information specifically includes: The literary style feature of the reply text is determined by analyzing the reply text, and the literary style feature includes one of the number of sentences, the number of words per sentence, and the arrangement order of the number of words in the reply text. Or a plurality of; selecting a corresponding change amount of the preset rhythm according to the literary style characteristics involved in the reply text. The change amount of the preset prosody rhythm is the prosody rhythm template, and there is a corresponding relationship between the literary style feature and the prosody rhythm template.

In the "poetry recitation" voice scene of the specific embodiment of the present invention, the terminal performs rhythmic template alignment on the content of the reply text, so as to facilitate subsequent speech synthesis. Specifically, when speech synthesis is required, the terminal may align the relevant content in the reply text with the rhythmic template of the "poem recitation" voice scene. Specifically, the terminal may combine the pronunciation of the corresponding acoustic model library with the relevant content in the reply text and the parameters of the prosodic rhythm template, and refer to a certain scale to superimpose the parameters of the prosodic rhythm template into these pronunciation segments.

For example, in an exemplary embodiment, the prosody enhancement parameter is ρ (0 <ρ <1), and the preset volume of the i-th word in the text content is Vi. If the prosodic rhythm feature of the word includes the accent feature , Whose re-reading change is E1, then the final volume of the word is Vi × (1 + E1) × (1 + ρ). For another example, if the basic sound length of the i-th word in the text is Di and the change amount of the sound length is E2, then the final sound length of the character is Di × (1 + E2). As another example, a pause is required between the i-th word and the i + 1-th word, and the pause time is changed from 0s to 0.02s.

Based on the first aspect, in a possible implementation manner, the acoustic model library may include a general acoustic model and several personalized acoustic models, where:

The preset information of the general acoustic model may include the preset sound speed, preset volume, preset pitch, preset tone, preset tone, preset rhythm, etc. of the model. The speech synthesized by the general acoustic model is normal, Sound effects in general dialogue scenarios.

The preset information of the personalized acoustic model may include voice characteristics and language style characteristics. That is, the preset information of the personalized acoustic model includes two or more of the preset sound speed, preset volume, preset pitch, preset tone, preset tone, and preset rhythm. In addition, it can also include other personalized information, such as one or more of language style characteristics including mantras, response methods to specific scenes, wisdom types, personality types, mixed popular languages or dialects, and titles of specific characters. Each. The speech synthesized by the personalized acoustic model can "simulate" the sound effect of the dialogue scene.

It should be understood that the preset sound speed, preset volume, preset pitch, preset tone, preset tone, preset rhythm, and other preset information of different acoustic models also differ. For example, personalized acoustics The preset information of the model may be significantly different from the preset information of the general acoustic model.

The following uses "character imitation" as an example to describe the implementation of an acoustic model related to "character imitation" in speech synthesis.

In the embodiment of the present invention, the terminal device may determine, through user input voice, that the current conversation needs to adopt an acoustic model of "character imitation", which specifically includes several methods:

(1) During the dialogue, the user's intention contained in the user's input voice clearly indicates that the current dialogue is "character imitation". After the terminal device determines the user's intention, it further determines that the current conversation is "character imitation". For example, if the user inputs a voice to instruct the terminal to speak with Lin Zhiling's voice, then after the terminal recognizes the user's intention, the terminal automatically sets the current dialogue scene as a "character imitation" scene.

(2) In a normal conversation, although the user has no clear intention to clearly indicate that the current conversation is "character imitation", the terminal device can still determine whether the content of the input text corresponding to the user's input voice involves the content of character imitation. In specific implementation, the content of the reply that can be imitated by characters can be determined by means of full-text matching, keyword matching, and semantic similarity matching. These contents include lyrics, sound effects, movie lines and cartoon dialogue scripts.

In a specific embodiment of the present invention, the acoustic model library of the terminal device is preset with various acoustic models (that is, personalized acoustic models) for implementing "character imitation". The acoustic model of "character imitation" can be used to make the synthesized speech have the sound characteristics of a specific character. Therefore, the information of the preset voice, preset tone, and preset rhythm of the acoustic model of "character imitation" and this information of the general acoustic model will be There are differences. The character imitated by the "personal imitation" acoustic model may be the user's favorite personal image, a character in a film or television work, or a combination of multiple preset sound modes and user preferences. For example, these "persons" The acoustic model of “imitation” can be an acoustic model in which the user imitates the user's own speaking style; it can also be an acoustic model that mimics the speaking characteristics of other characters, for example, an acoustic model used to imitate “Lin Zhiling / Soft Voice”, which can imitate “Little Shenyang / The acoustic model of "funny sound" can be an acoustic model that mimics "Andy Lau / Deep Voice", and so on. In addition, in a possible embodiment, during the speech synthesis process, the terminal does not select a specific acoustic model in the acoustic model library, but a comprehensive model (also referred to as a fusion model) of multiple acoustic models in the acoustic model library. ).

The terminal may obtain the acoustic model corresponding to "character imitation" from the acoustic model library, and the implementation methods may include the following:

(1) The terminal device may select a certain acoustic model or a certain fusion model from the acoustic model library according to the identity of the user. Specifically, since the identity of the user may be associated with the preference of the user, the terminal device may determine the preference of the user according to the identity of the user, and then select a certain acoustic model or a certain fusion model from the acoustic model library according to the preference of the user. For example, you can choose your favorite acoustic model to imitate "Lin Zhiling / Soft Voice", or the acoustic model to imitate "Little Shenyang / Funny Voice", or the acoustic model to imitate "Andy Lau / Various Voice", or some preset Fusion model and much more.

It should be noted that the acoustic model preferred by the user is not necessarily the personalized acoustic model originally set in the acoustic model library, but may be an acoustic model obtained by fine-tuning parameters of a personalized acoustic model according to the preference of the user. For example, the sound characteristics of a personalized acoustic model originally set in the acoustic model library include a first speech speed (speed of sound), a first intonation, a first rhythm, and a first tone color. The terminal determines the user's favorite various parameter combinations through analysis of user preferences or manual settings by the user: 0.8 times the first speech speed, 1.3 times the first intonation, 0.9 times the first rhythm, and 1.2 times the first feminization Tone color, so that these parameters are adjusted accordingly to obtain a personalized acoustic model that meets user needs.

(2) The terminal device determines an acoustic mode identifier related to the content of the current input voice according to the content of the currently input voice, and selects an acoustic model corresponding to the acoustic mode identifier from the acoustic model library. For example, the terminal may determine that the currently synthesized speech needs to use a "Chou Xingchi" type of sound according to the input text or user preference or reply text, and then select an acoustic model of the "Chou Xingchi" type of sound from the acoustic model library.

(3) After the terminal device selects multiple acoustic models in the acoustic model according to the identity of the user, determines a weight value (that is, a preference coefficient) of each acoustic model in the multiple acoustic models; The weight values of the acoustic models are set in advance by the user, or the weight values of the respective acoustic models are determined in advance according to the preferences of the user; and then the respective acoustic models are fused based on the weight values to obtain a fusion Acoustic model after.

For example, after the terminal device obtains the user's preferences or needs for voice, it can also directly match the sounds of multiple acoustic models according to the user's identity (that is, the user's preferences or needs are directly tied to the user's identity), thereby Determine the user's favorite coefficients for thick, soft, cute, funny, and other sound types are 0.2, 0.8, and 0.5, that is, the weights of these acoustic models are 0.2, 0.8, and 0.5, respectively. The final acoustic model (ie, fusion model) can be obtained by weighting and superimposing intonation, prosody, and tone color. The synthesized speech scene parameters realize the sound conversion of the acoustic model in terms of speech rate, intonation, rhythm, and timbre, which is conducive to producing mixed sound effects such as "talking Lin Zhiling" or "rapper mode Lin Zhiling".

Based on the first aspect, in a possible implementation manner, the TTS parameter further includes a correspondence between a target character and a user's preferred pronunciation. The customized character pronunciation table includes a mapping relationship between a target character and a user's preferred pronunciation. The mapping relationship between the target character and the user's preferred pronunciation is used to enable the target character involved in the speech synthesized by the acoustic model to have the user's preferred pronunciation. The mapping relationship between the target character and the user's preferred pronunciation is associated with the identity of the user, that is, different mapping relationships can be organized according to the identity of the user.

In the embodiment of the present invention, the customized character pronunciation table can be organized and stored according to user identity. The customized character pronunciation table corresponding to an unregistered user is empty, and the customized character pronunciation table corresponding to a registered user can be added based on the user's preference. Change, delete, etc. The object of the setting operation may be a character, a person / place name, a letter, a special symbol, etc. that are easily misunderstood by the terminal or the user likes. The customized character pronunciation table includes the mapping relationship between the target character (string) and the user's preferred pronunciation. The target character (string) can be a character (Chinese character or foreign language), a word, a phrase, a sentence, or a number or a symbol (such as a Chinese character). , Foreign characters, emoticons, punctuation, special symbols ...) and more.

Specifically, the terminal device may determine the correspondence between the target character and the user's preferred pronunciation according to the historical input voice of the user, associate the correspondence between the target character and the user's preferred pronunciation with the identity of the user, and write Enter the custom character pronunciation table.

For example, the original acoustic model of the terminal ’s “Piglet Page” reads “xiao3, zhu1, pei4, and qi2”. If the user tunes the terminal device through voice in advance, it is requested to set the “odd” pronunciation in the phrase “Pigpage”. If it is "ki1", the terminal device will record the "small pig page" and "xiao3" as a mapping relationship, and write the mapping relationship into the custom character pronunciation table associated with "xiaoming".

For another example, the terminal device may find the dialogue text output by the terminal in the last round or previous rounds of conversation in the context information, and determine the pronunciation of each word in the dialogue text (such as using an acoustic model to determine the pronunciation). For example, the output text of the terminal in the last round of conversation was "I'm glad to meet you, Xiao Qian", and the terminal determined that its corresponding pronunciation is "hen3, gao1, xing4, ren4, shi2, ni3, xiao3xi1". In this way, the DM module matches the misreading pronunciation with the pronunciation string of the output text to determine that the Chinese word corresponding to the misreading pronunciation "xiao3xiaoxi1" is "小茜", that is, "小茜" is Target term (the target character to be corrected). Furthermore, the terminal device adds the target word “小茜” and the target pronunciation “xiao3 qian4” as new target character-phonetic pairs to a custom character pronunciation table associated with the current user identity.

In this way, in the speech synthesis of the current conversation, when the terminal device finds that the target character is associated with the identity of the user in the reply text, it uses the acoustic model to read the target character according to the target character and the user's preference. The correspondence relationship among them, the basic speech synthesis information and the enhanced speech synthesis information perform speech synthesis on the reply text. For example, in the current real-time man-machine conversation, when the response text of the terminal device contains "Xiao Qian", the terminal device will determine that the pronunciation of "Xiao Qian" is "xiao3 qian4" according to the record of the customized character pronunciation table. In this way, the pronunciation of "Xiao Qian" in the reply speech obtained by speech synthesis through the acoustic model is "xiao3 qian4".

Based on the first aspect, in a possible implementation manner, the TTS parameter further includes a background sound effect, that is, the TTS parameter database may include a music library, the music library includes multiple music information, and the music information is used in a speech synthesis process. Provides background sound effects in. The background sound effect specifically refers to a certain music segment (such as pure music or song) or sound special effects (such as movie sound effects, game sound effects, language sound effects, animation sound effects, etc.) in the music. The background sound effect is used to superimpose different styles and rhythms of music or sound effects on the speech background synthesized by the acoustic model, thereby enhancing the expression effect of the synthesized speech (such as enhancing the emotional effect).

The following describes a method for synthesizing speech in an embodiment of the present invention by using a scene in which a synthesized speech is superimposed with a "background sound effect" as an example.

In the embodiment of the present invention, when the terminal device determines that the reply text has content suitable for superimposing background music, it is only necessary to superimpose a background sound effect on the synthesized speech. Specifically, the terminal device may automatically determine content suitable for superimposing background music. The content suitable for superimposing background music can be words with emotional polarity, can be poetry, can be film and television lines, and so on. For example, the terminal can identify the sentiment-oriented words in the sentence through the DM module, and then determine the emotional state of the phrase, sentence, or the entire reply text through methods such as grammatical rule analysis and machine learning classification. In this process, the emotional dictionary can be used to identify these emotionally inclined words. The emotional dictionary is a collection of words, and the words in the collection have obvious emotional polarity tendencies, and the emotional dictionary also contains the polarity information of these words. For example, the words in the dictionary are identified with the following emotional polarity: happy, like, sadness, surprise, angry, fear, disgust, and other emotions Polarity type. In a possible embodiment, different types of emotional polarity can be further divided into multiple levels of emotional intensity (for example, divided into five levels of emotional intensity).

After determining that there is content suitable for superimposing a background sound effect in the reply text, the terminal determines a background sound effect to be superimposed from the music library. Specifically, the terminal sets an identification of an emotional polarity category for different segments (ie, sub-segments) of each music file in the music library in advance. For example, these segments are identified as the following emotional polarity types: happy, like, Sadness, surprise, anger, fear, disgust, etc. Assuming that the current reply text includes text with emotional polarity, after determining the emotional polarity category of these texts, the terminal device searches the music library for a music file with a corresponding emotional polarity category identifier. In a possible embodiment, if the type of emotional polarity can be further divided into multiple levels of emotional intensity, the emotional polarity category and the identification of the emotional intensity are set in advance for each sub-segment in the music library, then these texts are determined After the emotion polarity category and emotion intensity of the subject, find a sub-segment combination with the corresponding emotion polarity category and emotion intensity identifier in the music library as the final selected background sound effect.

The following describes the process in which the terminal device selects the most matching background sound effect in the preset music library according to part or all of the reply text. The terminal device can split the content in the reply text that needs to be superimposed with background sound effects into different parts (split or punctuated according to punctuation). Each part can be called a sub-content, and the emotional polarity type of each sub-content is calculated. And emotional intensity. Furthermore, after determining the background sound effect that best matches the content in the music library, align the content with the matched background sound effect, so that the emotional change of the content is basically consistent with the emotional change of the background sound effect. Specifically, the best-matching background sound effect includes a plurality of sub-segments, each of which has an identification of an emotional polarity type and an identification of an emotional intensity, and the emotional polarity indicated by the identification of the emotional polarity type of each sub-segment The sexual type is the same as the emotional polarity type of each sub-content, and the change trend between the emotional intensity indicated by the identifier of the emotional intensity of each sub-segment and the emotional intensity of each sub-content The trends are consistent.

For example, in an application scenario, the reply text is "The weather is good, the national football team has won again, so happy." The entire content of the reply text needs to be superimposed with background sound effects. The reply text is split into "Good weather," " The national football team won the game again, and the sub-contents of the three parts "" Happy ", and the emotional polarity categories of each sub-content are all happy, and have different emotional strengths. A music file whose emotional polarity category is happy is initially determined in the music library. Further, the emotional change trajectory of the music file can be calculated and counted to obtain the emotional intensity of the three sub-segments in the music. The emotional change of the fragment is basically consistent with the emotional change trend of the three sub-contents of the reply text, so the music fragment composed of the three sub-segments in this music file is the background sound effect that matches the reply text. Therefore, you can align the three sub-segments of "the weather is good," "the national football team won again," and "happy" in the complex text. In this way, in the speech synthesis, the terminal device uses the selected acoustic model according to the The background sound effect (that is, the most matched music fragment), the basic speech synthesis information, and the enhanced speech synthesis information are used to speech synthesize the reply text, and the final reply speech output will present a "speech superimposed background sound effect" effect.

Based on the first aspect, in a possible implementation manner, the current dialogue scene may also be a “song song humming” voice scenario. In this scenario, the enhanced speech synthesis information used by the terminal device in the speech synthesis includes the “child song humming” Sing "voice scene parameters.

The speech synthesis method of the embodiment of the present invention is described below by taking a speech scene of "song humming (taking a nursery rhyme as an example)" as an example.

In music, time is divided into equal basic units, and each basic unit is called a "beat" or a beat. The time value of the beat is expressed by the time value of the note. The time value of a beat can be a quarter note (that is, a quarter note is a beat), a half note (a quarter note is a beat), or eight. Quarter note (takes eighth note as a beat). The rhythm of music is generally defined by beats, such as 4/4 beats: 4/4 beats are 4 quarter notes as a beat, and 4 beats per measure can have 4 quarter notes. The voice scene parameters of the so-called "Children's Song Humming" are preset types of beats of various children's songs, and a method of text segmentation of the content of the reply text that needs to be synthesized in the manner of "Children's Song Humming".

In the embodiment of the present invention, the terminal determines that the voice scene of the current conversation is the voice scene of "child songs humming" by replying to text and context information.

One way is that during the conversation, the user's input voice contains a user's intention to clearly indicate that the current conversation is a "child song humming" voice scene.

Another way is that in a normal conversation, although the user has no clear intention to clearly indicate that the current conversation is "Children's Song Humming", the terminal can still determine whether the content of the reply text involves the content of children's songs through the DM module. In specific implementation, the DM module can search the local pre-stored nursery rhyme library or search the nursery rhyme library in the web server through text search matching or semantic analysis. The lyric library can contain various lyrics of the nursery rhymes, and the DM module judges Whether the content of the reply text exists in these children's song lyrics, and if it exists, the current dialogue scene is set to the voice scene of "Children's Song Humming".

In the embodiment of the present invention, the terminal device may perform beat alignment on the content of the reply text to facilitate subsequent speech synthesis. Specifically, in a specific embodiment, the terminal may align the content of the reply text with the determined beat through the PM module, so as to ensure that each field of the text is fused with the change rule of the rhythm of the nursery rhyme. Specifically, the terminal aligns the cut text field with the time axis according to the change rule of the beat.

For example, if the number of words in a field in the reply text is 3, and the matching beat is 3/3 or 3/4, then the 3 words can be aligned with the 3 beats in a measure.

For another example, the number of words in a field in the reply text is less than the number of beats in the measure. If the field is 2 words and the beat is 4/4 beats, then search for adjacent text fields before and after the field. The number of words in the field before the field (or the field after the field) is 2, then you can merge this field with the field before the field to align the 4 beats in the measure together. If the fields before and after cannot be merged, or the number of words after the merge is still less than the number of beats, you can further align the beats in the following ways: One way is to fill the part with less text than the number of beats with blanks. Another way is to align the rhythm by lengthening the sound length of a word. Another way is to lengthen the sound length of each word evenly to ensure the overall time alignment.

In a second aspect, an embodiment of the present invention provides a speech synthesis device. The device includes a processor and a memory coupled to the processor, where:

The memory is used to store an acoustic model library and a speech synthesis parameter database (may be referred to as a TTS parameter library). The acoustic model library stores one or more acoustic models, and the speech synthesis parameter database stores associations with the identity of the user. Basic speech synthesis information and enhanced speech synthesis information;

The processor is configured to determine the identity of the user according to the current input voice of the user; obtain an acoustic model from the acoustic model library according to the current input voice, and the preset information of the acoustic model includes a preset sound velocity, a preset Set two or more of volume, preset pitch, preset tone, preset intonation, and preset prosody; determine basic speech synthesis information from the speech synthesis parameter database according to the identity of the user, so The basic speech synthesis information includes a change amount of one or more of the preset sound speed, the preset volume, and the preset pitch; determining a reply text according to the current input voice; and according to the reply text, all The context information of the currently input speech determines the enhanced speech synthesis information from the speech synthesis parameter database, and the enhanced speech synthesis information includes one of the preset tone color, the preset intonation, and the preset rhythm. A plurality of changes; the response text is processed according to the basic speech synthesis information and the enhanced speech synthesis information through the acoustic model Sound synthesis.

Based on the second aspect, in a possible embodiment, the processor is specifically configured to determine a literary style feature of the reply text according to the reply text, where the literary style feature includes part or all of the reply text One or more of the number of sentences of the content, the number of words per sentence, and the arrangement order of the number of sentences; selecting the corresponding amount of change in the preset rhythm from the speech synthesis parameter database according to the literary style characteristics involved in the reply text Wherein, there is a correspondence between the literary style feature and the amount of change in the preset rhythm, and the amount of change in the preset rhythm represents the reading duration of characters in some or all of the content of the reply text, Changes in the reading pause position, reading pause time, and stress.

Based on the second aspect, in a possible embodiment, the preset information of the selected acoustic model further includes a language style feature, and the language style feature specifically includes a mantra, a response mode to a specific scene, a type of wisdom, a type of personality, One or more of mixed popular languages or dialects, and titles for specific characters.

Based on the second aspect, in a possible embodiment, there are multiple acoustic models in the acoustic model library; the processor is specifically configured to: determine the preferences of the user according to the identity of the user; and according to the user's It is preferred to select an acoustic model from the acoustic model library.

Based on the second aspect, in a possible embodiment, there are multiple acoustic models in the acoustic model library, and each acoustic model has an acoustic mode identifier; the processor is specifically configured to: Content, determining an acoustic mode identification related to the content of the currently input speech; selecting an acoustic model corresponding to the acoustic mode identification from the acoustic model library.

Based on the second aspect, in a possible embodiment, there are multiple acoustic models in the acoustic model library; the processor is specifically configured to: select multiple acoustic models in the acoustic model according to the identity of the user; Determining a weight value of each of the plurality of acoustic models; wherein the weight value of each of the acoustic models is preset by a user, or the weight value of each of the acoustic models is based on a preference of the user in advance It is determined; the respective acoustic models are fused based on the weight values to obtain a fused acoustic model.

Based on the second aspect, in a possible embodiment, the processor is further configured to: before determining the identity of the user based on the user's current input voice, determine the target character and the user's preferred pronunciation based on the user's historical input voice. The corresponding relationship between the target character and the user's preferred pronunciation is associated with the identity of the user, and the corresponding relationship between the target character and the user's preferred pronunciation is saved to the speech synthesis parameter database; The processor is further specifically configured to: when the target character associated with the identity of the user exists in the reply text, according to a correspondence relationship between the target character and a user's preferred pronunciation through the acoustic model , The basic speech synthesis information and the enhanced speech synthesis information perform speech synthesis on the reply text.

Based on the second aspect, in a possible embodiment, the speech synthesis parameter database further stores a music library; the processor is further configured to: select a background sound effect from the music library according to the reply text, and the background sound effect Is a music or sound special effect; the processor is further specifically configured to perform speech synthesis on the reply text according to the background sound effect, the basic speech synthesis information, and the enhanced speech synthesis information through the acoustic model.

Based on the second aspect, in a possible embodiment, the background sound effect has one or more identifiers of emotional polarity types and identifiers of emotional strength; the identifiers of the emotional polarity types are used to indicate at least one of the following emotions: happiness , Like, sadness, surprise, anger, fear, disgust; the identifier of the emotional intensity is used to indicate the respective value of the at least one emotion; the processor is specifically configured to: split the content of the reply text Into a plurality of sub-contents, and determine the emotional polarity type and emotional intensity of each sub-content respectively; according to the emotional polarity types and emotional intensity of each sub-content, select the most matching background sound effect in the music library; The best-matching background sound effect includes a plurality of sub-segments, each of which has an identification of an emotional polarity type and an identification of an emotional intensity, and the type of emotional polarity indicated by the identification of the emotional polarity type that each of the sub-segments has respectively corresponds to The types of emotion polarity of each sub-content are the same, and the intensity of emotion indicated by the identifier of the intensity of emotion of each sub-segment Trend between the trend between the intensity of the emotion and the respective sub-content consistent.

Based on the second aspect, in a possible embodiment, the device may further include an audio circuit. Among them: the audio circuit can provide an audio interface between the device and the user, and the audio circuit can further be connected with a speaker and a microphone. On the one hand, the microphone can collect the user's voice signals, and convert the collected voice signals into electrical signals, which are received by the audio circuit and converted into audio data (that is, forming the user's input voice), and then the audio data is transmitted to the processor for voice Processing, on the other hand, the processor 2011 synthesizes the reply speech based on the user's input speech and transmits it to the audio circuit. The audio circuit can convert the received audio data (that is, the reply speech) to an electrical signal, and then transmit it to the speaker. The sound signal is converted by the speaker and output.

In a third aspect, an embodiment of the present invention provides a speech synthesis device, which is characterized in that the speech synthesis device includes a speech recognition module, a speech dialogue module, and a speech synthesis module, wherein:

A voice recognition module for receiving a user's current input voice;

A voice dialogue module, configured to determine the identity of the user based on the user's current input voice; determine basic speech synthesis information based on the identity of the user, and the basic speech synthesis information includes a preset sound velocity, a preset volume, and a preset of an acoustic model. Set one or more changes in pitch; determine a reply text based on the current input speech; determine enhanced speech synthesis information based on the reply text and context information, the enhanced speech synthesis information including a preset of the acoustic model The amount of change in one or more of timbre, preset intonation, and preset rhythm;

A speech synthesis module, configured to obtain the acoustic model from a preset acoustic model library according to the current input voice, and the preset information of the acoustic model includes the preset sound speed, the preset volume, the preset Setting a pitch, the preset tone color, the preset intonation, and the preset prosody rhythm; and using the acoustic model to voice the reply text according to the basic speech synthesis information and the enhanced speech synthesis information synthesis.

The speech recognition module, speech dialogue module, and speech synthesis module are specifically configured to implement the speech synthesis method described in the first aspect.

According to a fourth aspect, an embodiment of the present invention provides a computer-readable storage medium, where the computer-readable storage medium stores instructions, and when the computer-readable storage medium is run on a computer, causes the computer to execute the method described in the first aspect.

In a fifth aspect, an embodiment of the present invention provides a computer program product containing instructions, which when run on a computer, causes the computer to execute the method described in the first aspect above.

It can be seen that by implementing the technical solution of the embodiment of the present invention, the terminal can select different TTS parameters for different users based on the reply text of the dialogue interaction and the dialogue context information, so as to automatically combine user preferences and dialogue scenarios to generate different styles of Reply to the voice, provide personalized speech synthesis effect to different users, greatly improve the voice interaction experience between the user and the terminal, and improve the timeliness of human-machine dialogue. In addition, the terminal also allows the user to tune the terminal's voice response system in real time by voice, and update the TTS parameters associated with the user's identity and preferences, making the tuned terminal closer to the user's interaction preferences and maximizing the user's interactive experience.

BRIEF DESCRIPTION OF THE DRAWINGS

In order to more clearly explain the technical solutions in the embodiments of the present invention or the background art, the drawings that are needed in the embodiments of the present invention or the background art will be described below.

FIG. 1 is a schematic diagram of basic physical elements of speech according to an embodiment of the present invention;

2 is a schematic diagram of a system architecture according to an embodiment of the present invention;

3 is a schematic diagram of still another system architecture according to an embodiment of the present invention;

4 is a schematic structural diagram of a system architecture and a terminal device according to an embodiment of the present invention;

5 is a schematic diagram of a TTS parameter database provided by an embodiment of the present invention;

6 is a schematic diagram of an acoustic model library provided by an embodiment of the present invention;

7 is a schematic diagram of a speech synthesis process according to an embodiment of the present invention;

8 is a schematic diagram of speech synthesis of a reply text provided by an embodiment of the present invention;

9 is a schematic structural diagram of still another system architecture and a terminal device according to an embodiment of the present invention;

10 is a schematic flowchart of a speech synthesis method according to an embodiment of the present invention;

11 is an exemplary diagram of basic TTS parameters associated with a user identity according to an embodiment of the present invention;

FIG. 12 is an exemplary diagram of a customized character pronunciation table provided by an embodiment of the present invention; FIG.

FIG. 13 is an exemplary diagram of an emotional parameter correction mapping table according to an embodiment of the present invention; FIG.

14 is an exemplary diagram of a speech emotion parameter associated with a user identity according to an embodiment of the present invention;

15 is an exemplary diagram of a scene parameter modification mapping table provided by an embodiment of the present invention;

FIG. 16 is an exemplary diagram of a voice scene parameter associated with a user identity according to an embodiment of the present invention; FIG.

17-19 are exemplary diagrams of calling instructions corresponding to a reply text provided by an embodiment of the present invention;

20 is a schematic flowchart of a method for updating a customized character pronunciation table according to an embodiment of the present invention;

21 is a schematic flowchart of a method for determining a TTS parameter required for a current reply text according to an embodiment of the present invention;

22 is a schematic flowchart of a speech scene-related speech synthesis method of "poem recitation" provided by an embodiment of the present invention;

FIG. 23 is a schematic diagram of aligning a rhythmic template with content of a reply text according to an embodiment of the present invention; FIG.

FIG. 24 is a schematic flowchart of a speech scene-related speech synthesis method for a “song humming” according to an embodiment of the present invention; FIG.

FIG. 25 is a schematic diagram of performing beat alignment on content of a reply text according to an embodiment of the present invention; FIG.

FIG. 26 is a schematic flowchart of a scene-related speech synthesis method for “character imitation” according to an embodiment of the present invention; FIG.

FIG. 27 is an exemplary diagram of sound characteristics corresponding to sound characteristics of some specific acoustic models according to an embodiment of the present invention; FIG.

FIG. 28 is a schematic diagram of an interface for selecting a parameter of a speech feature and a parameter of a language style feature according to an embodiment of the present invention; FIG.

29 is a schematic flowchart of a speech synthesis method for a scene with a background sound effect superimposed according to an embodiment of the present invention;

FIG. 30 is a schematic diagram of determining a most matching music segment according to an embodiment of the present invention; FIG.

FIG. 31 is a schematic structural diagram of a hardware device according to an embodiment of the present invention.

detailed description

Nowadays, with the rapid development of man-machine dialogue technology, people have higher requirements on the timeliness and personalization of man-machine dialogue. Users are no longer satisfied with machines "speaking like humans," but instead expect machines to provide personalized voice interactions for different users. For example, when the user is an old lady with poor hearing, she would like the machine to automatically increase the volume of the voice; for example, the user would like to tune the machine like an educator, so that the machine's voice response matches his personality, mood, and hobby Etc .; for example, the user wants the machine to respond more vividly and interestingly, and the dialogue tone conforms to the contextual emotion; for example, the user wants the machine to respond to the dialogue scene, such as the machine automatically reads poetry, sings, tells a story, etc. . Based on this, embodiments of the present invention provide a speech synthesis method and a corresponding device, which are used to meet people's needs for personalized and diversified speech synthesis in the process of human-computer interaction.

The following describes the embodiments of the present invention with reference to the accompanying drawings in the embodiments of the present invention. The terms used in the embodiments of the present invention are only used to explain specific examples of the present invention, and are not intended to limit the present invention.

In order to facilitate understanding of the technical solutions of the embodiments of the present invention, the related concepts involved in the embodiments of the present invention are explained first.

Speech (sound), which is the sound of language, is the sound wave form of a language communication tool. Speech realizes the language's expression and social functions. The basic physical elements of speech include sound intensity, sound length, pitch, and tone color. See Figure 1, which are described as follows:

(1) Intensity. In different scenes, the sound intensity may be called volume, tone, stress, stress, and so on. The sound intensity is determined by the amplitude of the sound wave, which is directly proportional to the amplitude of the sound wave, indicating the strength of the sound. Sound intensity has the function of distinguishing the meaning of words and certain grammatical functions in Chinese. For example, sound intensity determines the meaning of soft sound and stress.

(2) Duration. The duration indicates the duration of the sound wave vibration. It is determined by the duration of the sound body vibration. The longer the vibration time, the longer the sound wave. The sound length can be represented by the concept of speed. The speed of sound indicates the speed of the sound. That is, the longer the sound length, the slower the sound speed.

(3) Pitch, sometimes called pitch. The pitch is determined by the frequency of the vibration of the sound wave. The higher the frequency, the higher the pitch. In Chinese, the tone of Chinese characters and the intonation of sentences are mainly determined by the pitch.

(4) Timbre. In different scenes, timbre may be called sound quality, voice quality, etc. The tone color represents the characteristics and nature of the sound, and different tone colors correspond to different zigzag forms (sound wave shapes) of sound wave ripples. The timbre is the basic characteristic of a sound that is different from other sounds, and the timbre of different people (or pronunciation bodies) is different.

Chinese is different from the Western language family, and its manifestations are in grammatical structure, grammatical rules, acoustic characteristics, and prosodic structure. In Chinese, Chinese characters are one character and one sound, that is, a syllable is generally a Chinese character. Tones are an integral part of the syllable structure. Tones are usually used to indicate the rise and fall of a syllable, so the tone is also called the tone. . The formation of tones is mainly determined by changes in pitch, in addition to changes in pitch. During the pronunciation process, the pronunciation body can adjust the changes in pitch and length at any time, so that different tones are formed. Tones play an important role in distinguishing meanings. For example, tones are used to distinguish the meaning of the words "theme" and "genre", "exercise", and "connection" in Chinese speech. In addition, in Chinese, each character has a corresponding fundamental frequency (the frequency of the fundamental sound, which determines the pitch of the basic sound of the character), and the fundamental frequencies between the characters may also affect each other to produce a sound. The fundamental frequency variation (ie, the phenomenon of sound change). In addition, in Chinese, there is a pause in the pronunciation of consecutive sentences, and different words in the sentence will be light or accented according to the upper and lower semantics. These grammatical structures, grammatical rules, acoustic characteristics, and prosodic structures of the Chinese language together form the tone, frustration, mood, and rhythm of the Chinese language.

The following describes the system architecture involved in the embodiments of the present invention. The system architecture of the embodiment of the present invention relates to a user and a terminal. The user inputs a voice to the terminal, and the terminal can process the user's voice through a voice response system to obtain a voice for the user and present the reply to the user. The terminal in the embodiment of the present invention may be a dialog interactive robot, a home / commercial robot, a smart speaker, a smart table lamp, a smart home appliance, a smart furniture, a smart vehicle, or a mobile phone, a notebook computer, a tablet computer, etc. Voice assistant / voice conversation software on the device.

For example, in an application scenario, referring to FIG. 2, the terminal is a robot, and the user sends a voice to the robot (for example, the user speaks directly to the robot), and the robot responds to the user with a voice (for example, the robot plays through a buzzer) Replied voice), thereby realizing a human-machine dialogue between the user and the robot.

As another example, in another application scenario, referring to FIG. 3, the terminal is a voice assistant applied to a smart phone, and the user sends a voice to the voice assistant (for example, the user triggers a voice assistant related icon displayed on the smart phone) Speech), the voice assistant responds to the user as a response (for example, the voice displays the voice message on the screen, and the bouncer plays the reply voice), thereby realizing the interactive dialogue between the user and the voice assistant.

In addition, it should be noted that the terminal may also be a server. For example, in another application scenario, the terminal sends a voice to a smart phone, the smart phone transmits voice information to the server, and the server obtains a reply voice based on the voice information, and the reply voice Returning to the smart phone, the smart phone presents the reply voice to the user (such as displaying voice information on the screen and playing the reply voice through the buzzer, etc.), thereby realizing the interactive dialogue between the user and the server.

The following describes the terminal's voice response system in the above system architecture in detail.

Referring to FIG. 4, FIG. 4 shows a voice response system 10 of a terminal in a system architecture. As shown in FIG. 4, the voice response system 10 includes a voice recognition module 101, a voice dialog module 102, and a voice synthesis module 103. The functions of each module are described as follows:

(1) Speech recognition (Automated speech recognition, ASR) module 101. The ASR module 101 is used to recognize the content of a user's input speech, recognize the content of the speech into text, and realize the conversion from "speech" to "text".

(2) The voice dialogue module 102, which can be used to generate a reply text based on the recognition text input by the ASR module 101, and transmit the reply text to the voice synthesis module 103; the voice dialogue module 102 is also used to determine the personalization corresponding to the reply text TTS parameters of the mobile phone to facilitate subsequent speech synthesis module 103 to perform speech synthesis on the reply text based on the relevant TTS parameters. In a specific embodiment, the voice dialog module 102 may specifically include the following modules:

Natural Language Understanding (NLU) module 1021, the NLU module 1021 can be used to perform grammatical analysis and semantic analysis on the recognition text input by the ASR module 101, so as to understand the content of the user's speech (voice).

Natural Language Generation (NLG) module 1022, the NLG module 1022 can be used to generate a corresponding reply text based on the content of the user's speech and context information.

A dialog management (Dialogue Management, DM) module 1023 is used to track the current session state and control the dialog strategy.

User Management (UM) module 1024. The UM module 1024 is responsible for user identity confirmation, user information management, etc. In specific embodiments, the UM module 1024 can use existing identity recognition systems (such as voiceprint recognition, face recognition Even multi-modal biometrics) to determine user identity.

Intent recognition module 1025: The intent recognition module 1025 can be used to identify the user intention indicated by the user's speaking content. In a specific embodiment, corpus knowledge related to TTS parameter setting may be added to the intent recognition module 1025, and the intent recognition module 1025 may identify an interaction intention that a user wants to set (update) for one or more TTS parameters.

TTS parameter database 1026, as shown in FIG. 5, TTS parameter database 1026 is used to store basic TTS parameters (or basic speech synthesis information), enhanced TTS parameters (or enhanced speech synthesis information), custom character reading tables, music libraries, etc. The information is described as follows:

The basic TTS parameter represents a change in one or more of a preset sound velocity, a preset volume, and a preset pitch of an acoustic model used in synthesizing speech. The basic TTS parameter is associated with a user's identity, and That is to say, different basic TTS parameters can be organized according to the identity of the user (or according to the preferences of the user).

The enhanced TTS parameter represents a change in one or more of a preset tone color, a preset tone, and a preset prosody rhythm of an acoustic model used in synthesizing speech. In practical applications, the enhanced TTS parameter may further It is classified into speech emotion parameters and speech scene parameters. The speech emotion parameters are used to make the speech synthesized by the acoustic model present specific emotional characteristics. According to the different emotional characteristics, the speech emotion parameters can be further classified into neutral emotion, mild happiness, moderate happiness, extreme happiness, light For parameters such as sadness and moderate sadness, please refer to the detailed description below for specific implementation. The speech scene parameters are used to make the speech synthesized by the acoustic model present specific scene characteristics. According to different scene characteristics, the speech scene parameters can be further divided into daily conversation, poetry recitation, song humming, storytelling, News broadcast and other parameters, that is to say, the use of these voice scene parameters in speech synthesis will enable the synthesized speech to show the sound effects of daily dialogue, poetry recitation, song humming, storytelling, news broadcast and other voice scenes. The specific implementation method can be Refer to the detailed description later.

The customized character pronunciation table includes a mapping relationship between a target character and a user's preferred pronunciation. The target character may be a character (Chinese character or other character), a letter, a number, a symbol, or the like. The mapping relationship between the target character and the user's preferred pronunciation is used to enable the target character involved in the speech synthesized by the acoustic model to have the user's preferred pronunciation. The mapping relationship between the target character and the user's preferred pronunciation is related to the identity of the user, that is, different mapping relationships can be organized according to the identity of the user. For specific implementation, please refer to the detailed description later.

The music library includes a plurality of music information, and the music information is used to provide a background sound effect in a speech synthesis process. The background sound effect may be specific music or a sound special effect. The background sound effect is used to superimpose different styles and rhythms of music or sound effects on the speech background synthesized by the acoustic model, thereby enhancing the expression effect of the synthesized speech (such as enhancing the emotional effect). For specific implementation methods, refer to the following. Detailed Description.

TTS Parameter Management (PM) module 1026: The PM module 1027 is used to manage TTS parameters in the TTS parameter database, and the management method includes performing query on one or more TTS parameters according to the user's intention to set the TTS parameters , Add, delete, update (change), select, get (OK), etc. For example, in a specific embodiment, the PM module 1027 may be used to determine a basic TTS parameter associated with the user according to the identity of the user, and to determine an enhanced TTS parameter used to enhance the speech synthesis effect according to the content and context information of the reply text.

(3) Speech synthesis (TTS) module 103. The TTS module 103 is used to convert the reply text generated by the voice dialog module 102 into a reply voice, so as to present the reply voice to the user. The TTS module 103 may specifically include the following modules:

Instruction generation module 1031, instruction generation module 1031 may be configured to generate or update a call instruction based on the reply text and TTS parameters (including basic TTS parameters and enhanced TTS parameters) transmitted from the voice dialog module 102, and the call instructions may be applied to the TTS engine 1032.

TTS engine 1032, TTS engine 1032 is used to call or generate the appropriate instruction model from the acoustic model library 1033 according to the calling instruction generated or updated by the instruction generation module 1031, and use the acoustic model to strengthen the TTS based on the basic TTS parameters based on the acoustic model. The parameters, the mapping relationship between the target character and the user's preferred pronunciation, background sound effects, and other information are used to synthesize the reply text to generate a reply speech and return the reply speech to the user.

The acoustic model library 1033, as shown in FIG. 6, the acoustic model library 1033 may include multiple acoustic models, such as a general acoustic model, and several personalized acoustic models, and so on. These acoustic models are all neural network models, and these neural network models can be trained in advance from different corpora. For each acoustic model, each acoustic model has its own preset information, that is, each acoustic model is bound to a specific preset information. These preset information can be used as the basic input information of the acoustic model. For example, the preset information of the general acoustic model may include two or more of the preset sound speed, preset volume, preset pitch, preset tone, preset tone, and preset rhythm of the model; personality The preset information of the acoustic model includes two or more of the preset sound speed, preset volume, preset pitch, preset tone, preset tone, and preset rhythm. Other personalized information, such as mantras, ways of responding to specific scenes, types of wisdom, personality types, mixed popular languages or dialects, appellations of specific characters, and other language style characteristics. It should be understood that the preset sound speed, preset volume, preset pitch, preset tone, preset tone, preset rhythm, and other preset information of different acoustic models also differ. For example, personalized acoustics The preset information of the model may be significantly different from the preset information of the general acoustic model. In the embodiment of the present invention, the acoustic model can convert the reply text into a reply voice according to the preset information and the change information of the preset information. The change information of the preset information referred to here means information such as a basic TTS parameter, an enhanced TTS parameter, a mapping relationship between a target character and a user's preferred pronunciation, and a background sound effect selected in speech synthesis. The speech synthesized through the general acoustic model presents the sound effects under normal and general dialogue scenarios, while the speech synthesized through the personalized acoustic model can "simulate" the sound effects of dialogue scenarios. The implementation method of the dialogue scene of "character imitation" will be described in detail later.

It should be noted that, in a possible embodiment, each module in the embodiment shown in FIG. 4 may be a software module, and these software modules may be stored in the memory of the terminal device, and the processor in the terminal device calls these modules in the memory to Perform a speech synthesis method. In addition, in a possible embodiment, the implementation form of each module in the embodiment in FIG. 4 may be a hardware component in a terminal device.

The following briefly describes the process of speech synthesis based on the speech response system described in FIG. 4. Referring to FIG. 7, after the voice response system obtains the user's input voice, the response text is obtained through the voice recognition module and the voice dialogue module. The voice dialogue module determines the basic TTS parameters associated with the identity from the TTS parameter database based on the current user identity; based on the reply text The context information determines the enhanced TTS parameters and background sound effects from the TTS parameter database; if there is a target character associated with the user identity in the reply text, it also determines the user's preferred pronunciation corresponding to the target character. After that, the speech synthesis module calls the appropriate acoustic model from the acoustic model library based on the user's input speech or the user's preference (the user's preference is associated with the user's identity) or the reply text, and combines the TTS parameters (basic TTS parameters, enhanced TTS parameters, the mapping relationship between the target character and the user's preferred pronunciation, and one or more of the background sound effects) to perform speech synthesis to generate a reply speech for presentation to the user.

In order to facilitate understanding of the solution of the embodiment of the present invention, FIG. 8 is taken as an example for description below. FIG. 8 shows a speech synthesis process in an application scenario. As shown in FIG. 8, in this application scenario, the voice response system obtains the user ’s After inputting the voice, the reply text obtained through the voice recognition module and the voice dialogue module is "the weather is very good today". The voice dialogue module determines the basic TTS parameters associated with the identity of the user, and determines the voice based on the content and context information of the reply text Enhanced TTS parameters such as emotion parameters and voice scene parameters, and the background sound effect is determined based on the content of the reply text. Then, the speech synthesis module can use the selected acoustic model based on the selected basic TTS parameters, voice emotion parameters, and voice scene parameters. Synthesize the reply text with the background sound effect, and finally generate a synthesized speech (jin1, tian1, tian1, qi4, hen3) for replying to the user.

It should be noted that the embodiment in FIG. 4 is only a specific implementation manner of the present invention. In other possible implementation manners of the present invention, it may also include more or fewer functional modules, and between the functional modules described above. There may also be appropriate splits, combinations, changes to deployment methods, and so on.

For example, the acoustic model library 1033 can be deployed in the TTS engine 1032 to make it easier for the TTS engine to call the acoustic model and perform speech synthesis through the acoustic model.

For example, the acoustic model library 1033 may also be deployed in the voice dialogue module 102, or deployed outside the voice dialogue module 102.

For example, in a possible implementation manner, the PM module 1027 and the TTS parameter database 1026 may also be integrated together and independently deployed at a location outside the voice dialog module 102.

For example, in a possible implementation manner, the PM module 1027 may also be specifically deployed in the TTS engine 1032, that is, "TTS parameter management" may be implemented as a function of the TTS engine 1032. For another example, in a possible embodiment, the intent recognition module 1025 may also be specifically deployed in the DM module 1023, that is, "intent recognition" may be implemented as a function of the DM module 1023.

For example, in a possible embodiment, the TTS parameter database 1026 may be specifically deployed in the PM module 1027, that is, the PM module 1027 may organize and store TTS parameters by category and user identity; or, the TTS parameter database 1026 may also be used in a voice conversation Independent deployment at locations other than the module 102; or, the acoustic model library 1033 can be independently deployed at a location other than the TTS module 103; or, the acoustic model library 1033 can also be deployed with the TTS parameter library 1026, and so on.

For another example, as shown in FIG. 9, in order to enrich the selectability of TTS parameters in speech synthesis, the PM module 1027 may be split into a basic TTS parameter management module 1028 and an enhanced TTS parameter management module 1029. The basic TTS parameter management module 1028 is used to manage the basic TTS parameters and customized character pronunciation tables in the TTS parameter database 1026. The management method includes executing one or more basic TTS parameters according to the user's intention to set the basic TTS parameters. Query, add, delete, update (change), select, get (OK) and other operations, and perform query, add, delete, update on the custom character pronunciation table according to the user's intention to set the user's preferred pronunciation corresponding to the target character (Change), select, get (OK), etc. During the speech synthesis process, the basic TTS parameter management module 1028 can also be used to obtain the basic TTS parameters associated with the user identity, the user's preferred pronunciation corresponding to the target character, and so on. The enhanced TTS parameter management module 1029 is used to manage the enhanced TTS parameters and music library in the TTS parameter database 1026. The management method includes performing query on one or more enhanced TTS parameters according to the user's intention to set the enhanced TTS parameters, adding new , Delete, update (change), select, get (OK), and perform queries, add, delete, update (change), select, get (OK) on the music library according to the user's intention to set the background sound effect And so on. During the speech synthesis process, the enhanced TTS parameter management module 1029 can obtain the enhanced TTS parameters and background sound effects used to enhance the speech synthesis effect according to the content and context information of the reply text.

It should be noted that, in a possible embodiment, each module in the foregoing embodiment in FIG. 9 may be a software module, and these software modules may be stored in the memory of the terminal device, and the processor in the terminal device calls these modules in the memory to Perform a speech synthesis method. In another possible embodiment, the implementation form of each module in the foregoing embodiment in FIG. 9 may be a hardware component in a terminal device.

For another example, in a possible implementation manner, the enhanced TTS parameter management module 1029 may also be deployed in the TTS engine 1032, that is, "enhanced TTS parameter management" may be implemented as a function of the TTS engine 1032.

It should also be noted that, in order to facilitate the understanding of the technical solution of the present invention, this document mainly describes the technical solution of the present invention based on the functional modules presented in the embodiment of FIG. 4, and the implementation of other forms of functional modules will be similarly implemented. I will not repeat them one by one.

It can be seen that in implementing the technical solution of the embodiment of the present invention, in the human-machine voice interaction between the user and the terminal, after the ASR module recognizes the user's voice as text, the voice dialogue module can generate the corresponding reply text on the one hand, and can be based on Dialogue interactive response text and dialogue context information, combined with the current user's identity, preferences and dialogue context, to select personalized TTS parameters, and then the TTS module can generate specific styles of reply speech based on these personalized TTS parameters to provide users with personalized The speech synthesis effect greatly improves the voice interaction experience between the user and the terminal, and improves the timeliness of human-machine dialogue. In addition, the terminal also allows the user to tune the terminal in real time through voice, and update the TTS parameters associated with the user's identity and preferences, making the tuned terminal closer to the user's interaction preferences and maximizing the user's interactive experience.

Referring to FIG. 10, based on the above system architecture and voice response system, the following describes the process of the speech synthesis method provided by the embodiment of the present invention, which is described from a multi-faceted perspective. The method process includes but is not limited to the following steps:

Step 101: The user inputs a voice to the terminal, and accordingly, the terminal obtains the voice input by the user.

The terminal in the embodiment of the present invention may be a dialog interactive robot, a home / commercial robot, a smart speaker, a smart table lamp, a smart home appliance, a smart furniture, a smart vehicle, or a mobile phone, a notebook computer, a tablet computer, etc. Voice assistant / voice conversation software on the device. For specific implementation, reference may also be made to the description of the embodiment in FIG. 2 or FIG. 3, and details are not described herein again.

Step 102: The terminal recognizes the content of the voice input by the user, and recognizes the voice as text.

In a specific embodiment, the terminal can recognize the content of the user's input voice through the ASR module of its voice response system, for example, the content of the user's input voice is identified as: "Speak too slowly, please speak faster", "Can the voice be spoken? Bigger "," What was the last sentence of someone in the depths of Baiyun "and so on. Among them, the ASR module can be directly implemented by using a current commercial ASR system. Those skilled in the art are familiar with the implementation manner, and will not be described here.

Step 103: The terminal determines the identity of the user.

In a specific embodiment, the terminal may recognize the identity of the user through the UM module of its voice response system. For example, the UM module may determine the voice input person through voiceprint recognition, face recognition, or even multi-modal biometric recognition ( (Ie user). Understandably, if the terminal recognizes that the user identity is a locally registered user (such as the current user is xiaoming), the TTS parameters corresponding to the user can be subsequently adjusted; if the terminal cannot identify the user identity, it is determined that the user is a stranger user (If the current user is xiaohua), the default TTS parameters can be adjusted subsequently.

Step 104. The terminal determines the user's speaking intention.

In a specific embodiment, the terminal may determine the user's intention to speak in combination with the NLU module and the intent recognition module of its voice response system. The implementation process includes the following: The NLU module performs text analysis on the recognized text, including word segmentation, semantic analysis, part-of-speech analysis, etc. Keywords / words. For example, related keywords / words for TTS parameter setting may include "sound", "volume", "speaking speed", "pronouncement", "feeling", "reciting", "fast", "slow", "happy" , "Sadness" and so on. The intent recognition module combines the context of the dialogue to perform reference resolution and sentence completion, and then can use template matching or statistical model to identify whether the user has the intention to update TTS parameters. The reference resolution refers to the Recognize which noun phrase the pronoun points to in the text.

Among them, the template matching method can first analyze the combinations of keywords and words appearing in common instructions, and then construct templates / rules to match specific intents, such as "... sound / speak / speak / read in text sentences … Slow / fast… ”, it can be considered that the user ’s speaking intention is to adjust the speed of sound in the basic TTS parameters corresponding to the user (such as the speed of sound is increased or decreased by 20%); for example,“… sound appears in the text sentence / Speak / speak / read… loud / whisper / loud / small… ”, you can consider that the user ’s speaking intention is to adjust the volume in the basic TTS parameters corresponding to the user (such as increasing or decreasing the volume by 20% ); If a sentence template such as "[word 1] should be pronounced / read ... [word 2]" in the text sentence just appeared, the user's intention to speak can be considered to need to correct / add the corresponding user The pronunciation of the custom character pronunciation table in the basic TTS parameters of the user; if a sentence template such as “… emotional / emotional / read / speak / speak… happy / happy / happy / happy…” appears in the text sentence, the user ’s Speaking intention is to The emotional parameter of the voice is set to "lightly happy"; if one or more poems / words appear in the text sentence, or a sentence pattern of "... read / read / reciting ... poem / poem / word ..." appears, the user can be considered as a user The intention of speaking is to set the speech scene parameters to "poetry reading", and so on.

Among them, for the statistical model method, common expressions corresponding to various user intents can be collected in advance, and each statement intent is classified to form a training set containing multiple labeled data, and then the labeled data of the training set is used. Train machine learning models. Training algorithms include, but are not limited to, Support Vector Machines (SVM) algorithms, Naive Bayes algorithms, Decision Tree algorithms, Neural Networks (NN) algorithms. Wait. In this way, after the model is trained, when the user's speaking intent needs to be determined, the keywords / words corresponding to the user's spoken text sentence are input to the model to determine the speaking intent corresponding to the text sentence. Further, the trained model can be classified in advance based on the dialogue area or topic type, such as being divided into "weather" category, "poem category", "song category", "news category", "life communication category", " Movie "," sports "and so on. In this way, the intent recognition module can determine the conversation area or topic type based on the current conversation state and the keywords / words of the text sentence, and then the intent recognition module takes the keywords / words as input first. In the corresponding dialog domain model or topic type model, the corresponding speech intent of the text sentence is determined.

Step 105: The terminal determines whether the user's speaking intention is to set the TTS parameters.

Step 106. If it is determined that the speaking intention is to set the TTS parameters (such as update, delete, and add operations), the terminal executes the setting operation of the TTS parameters according to the instruction of the speaking intention. The TTS parameters include basic TTS parameters such as the speed of sound, volume, and individual changes associated with the identity of the user, as well as custom character pronunciation tables, etc .; the TTS parameters also include enhanced emotional TTS parameters, voice scene parameters, and other TTS parameters and background. Sound effects, etc. It should be understood that in a possible implementation, the enhanced TTS parameter may be associated with the identity of the user, or it may not need to be associated with the identity of the user. The setting operations are operations such as adding TTS parameters, deleting TTS parameters, and updating (changing) TTS parameters.

In a specific embodiment, if the user is a locally registered user, an update operation may be performed on a TTS parameter associated with the user identity. If the user is an unregistered user, a local user identity may be created / registered for the user. The local user identity is initially associated with the default TTS parameters, and then the default TTS parameters associated with the user identity are updated.

In a specific embodiment, the terminal may use the PM module of the voice response system to update the TTS parameter associated with the user identity in the TTS parameter database according to the TTS parameter update instruction issued by the voice dialogue module (such as the NLU module and / or the intent recognition module). Perform the update operation.

For example, in the embodiment of the present invention, the basic TTS parameter represents the amount of change (or change coefficient) relative to the physical elements of the basic speech. For the basic TTS parameter, the amount of change in the preset sound speed, preset volume, and preset pitch is , Can be organized and stored according to user identity, see FIG. 11, which shows an exemplary chart of basic TTS parameters associated with user identity, as shown in FIG. 11, the array in the chart represents The rising / falling ratio of the preset sound speed, preset volume, and preset pitch of the selected acoustic model. The chart includes unregistered users and registered users. Among them, an unregistered user means a user who has not yet performed identity registration or failed authentication, and the associated preset sound speed, preset volume, and preset pitch change are all default values of 0; registered users indicate that identity registration has been performed and The authenticated users include, for example, "xiaoming", "xiaoming_mom", "xiaoming_grandma", "xiaoming_dad", and so on. It can be seen that for the user "xiaoming_grandma", the basic TTS parameters of the associated sound speed, volume, and pitch are: "-40%, + 40%, + 20%", that is, the user's When speaking, the basic speech corresponding to the reply text will reduce the speed of sound by 40%, increase the volume by 40%, and increase the pitch by 20%. In addition, you can also see that the registered users ’preset sound speed, preset volume, and preset pitch changes can be added, corrected / changed, and deleted. For example, the terminal speaks based on" xiaoming " The intention is to "increase the volume" and increase the change in the preset volume associated with "xiaoming" to "+ 20%" based on the default value "0"; for example, the terminal intends to "reduce the speed of sound" based on "xiaoming_mom" The amount of change in the preset sound speed associated with "xiaoming_mom" is reduced to "+ 20%" from the original "+ 40%", and so on.

For another example, in the embodiment of the present invention, the customized character pronunciation table may be organized and stored according to user identity. Referring to FIG. 12, FIG. 12 shows an exemplary diagram of a customized character pronunciation table associated with a user identity. As shown in FIG. 12, the custom character pronunciation table corresponding to an unregistered user is empty, and the custom character pronunciation table corresponding to a registered user can be added, changed, or deleted based on the user's preference. The object of the setting operation may be a character, a person / place name, a letter, a special symbol, etc. that are easily misunderstood by the terminal or the user likes. The customized character pronunciation table includes the mapping relationship between the target character (string) and the user's preferred pronunciation. The target character (string) can be a word (Chinese character or foreign language), a word, a phrase, a sentence, or a number, a symbol ( Such as Chinese characters, foreign characters, emoji, punctuation marks, special symbols ...) and so on. For example, the terminal ’s original pronunciation table “Piggy Page” is pronounced “xiao3, zhu1, pei4, qi2”. If “xiaoming” is intended to be spoken, the pronunciation of “odd” in the phrase “piggy page” is set to “ki1”. , The terminal writes a "piglet page" and "xiao3 zhu1 pei4 ki1" into a custom character pronunciation table associated with "xiaoming" as a mapping relationship. It can be understood that the chart shown in FIG. 12 is merely an example and not a limitation.

For another example, in the embodiment of the present invention, for the speech emotion parameter in the enhanced TTS parameter, the speech emotion parameter represents the change of intonation in the voice. The so-called tone change refers to the rise and fall of the pitch of the sound in the voice, and the importance of the volume. , The speed of sound, the pause / dwell time of speech and text, etc. These changes have a very important effect on the expression of the voice. Through the change of intonation, the voice can present complex emotions such as joy, joy, sadness, sadness, sadness, hesitation, ease, firmness, and heroism.

In a specific embodiment of the present invention, the TTS parameter database maintains a mapping relationship between “speech emotion suggested by the voice dialog module” and “speech emotion parameter”. The mapping relationship is, for example, the emotion parameter correction mapping table shown in FIG. 13. The speech synthesized based on different speech emotion parameters will carry the corresponding emotional muzzle. For example, the speech emotion suggested by the speech dialogue module is "Neutral", then the speech synthesis module synthesizes speech emotion parameters based on neutral emotions. The voice of the voice will reflect the tone of neutral emotion (that is, without any emotional characteristics); the voice emotion suggested by the voice dialogue module is "Happy_low", then the voice synthesis module synthesizes the voice emotion parameters based on the mildly happy voice The voice is a tone with mild happiness; the voice emotion suggested by the voice dialogue module is "Sad_low", then the voice synthesized by the voice synthesis module based on the voice emotion parameters of mild sadness is mild sadness Muzzle, wait. It can be understood that the chart shown in FIG. 13 is only an example and not a limitation.

In a specific embodiment of the present invention, in addition to the user's identity, the speech emotion parameters are also related to the reply text and context information. After the user identity is created, the default voice emotion parameters associated with the user identity can correspond to neutral emotions. During the voice conversation, the terminal can comprehensively determine the current voice synthesis process based on the user identity, reply text, and context information. Speech emotion parameters. For example, if the terminal determines that the response text and context information do not specify a voice emotion, or the specified voice emotion is consistent with the user's default voice emotion, the terminal selects the user's default voice emotion to apply to the final speech synthesis, for example, the user's default voice emotion is "Neutral sentiment", the terminal determines that the speech synthesis of the current reply text has no specified speech sentiment, the terminal still applies "neutral sentiment" to the synthesis of the final speech; if the terminal determines that the reply text and context information need to specify the speech sentiment, and The specified voice emotion is not consistent with the user's default voice emotion, then the terminal automatically adjusts the current voice emotion to the voice emotion specified by the terminal, for example, the user's default voice emotion is "neutral emotion", but the terminal determines the speech synthesis of the current reply text If "slightly happy" voice emotion is needed, then the terminal adopts the "slightly happy" voice emotion parameter for final speech synthesis.

In a specific embodiment, the terminal may update the voice emotion parameters associated with the identity of the user based on the user's speaking intention. As shown in FIG. 14, the terminal may change the voice emotion parameters associated with “xiaoming_grandma” according to the speaking intent of “xiaoming_grandma”, that is, change the voice emotion parameters of the “neutral emotion” to the voice emotion parameters of “lightly happy”. It can be understood that the chart shown in FIG. 14 is only an example and not a limitation.

For another example, in the embodiment of the present invention, for the speech scene parameters in the enhanced TTS parameters, the speech scene parameters represent the change of the rhythm in the speech. The so-called rhythmic rhythm change has a clearer and clearer rhythmic rhythm and strong emotional expression than the rhythmic rhythm in the natural state of ordinary dialogue, so that the voice dialog fits the specific application scenario. The change of rhythmic rhythm can be reflected in Speech pause position / pause time change, accent position change, word / word sound length change, word / word sound speed change, etc. The specific changes in these rhythmic rhythms can specifically present the voice scene effects such as "poetry recitation", "song humming (or nursery rhyme)", "story telling", and "news broadcasting".

In a specific embodiment of the present invention, the TTS parameter database maintains a mapping relationship between “speech scenarios suggested by the voice dialogue module” and “speech scenario parameters”, and the mapping relationship is, for example, the scenario parameter modification mapping table shown in FIG. 15. It is understandable that the speech synthesized based on different voice scene parameters will reflect the corresponding scene muzzle. For example, the speech synthesized based on the voice scene parameters of daily dialogue reflects the tone of daily conversation, and is synthesized based on the speech scene parameters of poetry recitation. Voice reflects the tone of poetry recitation, speech synthesized based on the voice scene parameters of song humming represents the tone of song humming, and so on. It can be understood that the chart shown in FIG. 15 is merely an example and not a limitation. In a possible embodiment, other voice scene parameters may also be designed based on the needs of actual applications, such as story interpretation, news broadcast, and the like.

In a specific embodiment of the present invention, the voice scene parameters are mainly related to the reply text and context information. Referring to FIG. 15, after the user identity is created, the voice scene corresponding to the default voice scene parameter associated with the user identity is “daily conversation”. During the voice conversation, the terminal may comprehensively determine the current situation based on the user identity, reply text, and context information. Parameters of the speech scene used in the speech synthesis process. For example, if the terminal determines that the reply text and context information do not specify a voice scene, or the designated voice scene is consistent with the user's default voice scene, the terminal selects the user's default voice scene parameters to apply to the final speech synthesis. For example, if the user ’s default voice sentiment is “daily conversation”, and the terminal determines that the speech synthesis of the current reply text has no specified speech scene, the terminal still applies “daily conversation” to the synthesis of the final speech; if the terminal determines that the reply text and context information require If the specified voice scene is inconsistent with the default voice scene of the user, the terminal automatically adjusts the current voice scene to the voice scene specified by the terminal. For example, the user's default voice emotion is "daily conversation", but the terminal determines that the speech synthesis of the current response text requires a "poem recitation" speech scene, and then the terminal applies the speech scene parameters corresponding to "poem recitation" to the final speech synthesis.

In a specific embodiment, the terminal may update the default voice scene parameters associated with the identity of the user based on the user's speaking intention. As shown in FIG. 16, the terminal may change the voice scene corresponding to the default voice scene parameter of “xiaoming_dad” from “daily conversation” to “poem recitation” according to the speaking intention of “xiaoming_dad”. It can be understood that the chart shown in FIG. 16 is merely an example and not a limitation.

It should be noted that the related content of the voice scene parameters such as "poem recitation" and "song humming (such as children's song humming)" will also be described in detail later, and will not be repeated here.

In addition, in order to better implement this step, in a possible implementation manner, after the intent recognition module determines the TTS parameter setting intent, the PM module performs a specific update operation. The process can be implemented as follows: The PM module maintains a parameter update The mapping table of the intent and the specific operation interface, so as to determine the corresponding operation API according to the currently identified intent ID. For example, for the purpose of increasing the volume, it calls the Update-Costomized-TTS-Parameters-volume interface, and its input is the user ID and the adjustment amplitude value; for example, for the intention of correcting the pronunciation of the symbol, it calls the Update-Costomized-TTS-Parameters-pron interface The input is the user ID and the symbol to be corrected, the target pronunciation string, and so on. If the current user is a registered user, the PM module executes the relevant update interface and implements the TTS parameter update process described above. If the current user is an unregistered user, the PM module can add a user information record for the strange user, and its associated TTS parameters use the default values, and then update the associated TTS parameters.

Step 107. The terminal generates a reply text in combination with the context information.

In an embodiment, if the user's speaking intention is to set the TTS parameters, the terminal generates a reply text after setting the TTS parameters based on the user's speaking intention, and the reply text is mainly used to set the terminal's completed TTS parameter setting. To the user. For example, if the user ’s intention indicated by the current user ’s voice input is “to increase the speed of sound” or “to increase the volume”, the preset text corresponding to the setting result may be returned as the reply text. For example, the reply text corresponds to “speak faster”, "The volume has been turned up a bit" and so on.

In another embodiment, if the user's speaking intention is not to set the TTS parameters, the terminal may combine the content of the user speaking and the context information of the user conversation to generate a reply text for replying to the user. For example, if the content of the user ’s input voice is "What is the weather today?", The terminal may query local network resources or obtain a reply text for replying to the user according to the conversation model. Wait; the content of the user's input voice is "What is the last sentence of someone in the depths of Baiyun", then the terminal can query local network resources or get the reply text "The last sentence of someone in the depths of Baiyun" according to the conversation model. It's 'Far on the Hanshan Stone Trail' ", and so on.

In a specific embodiment, the terminal may generate a reply text through the NLG module of the voice response system and the context information in the DM module. In specific implementation, the reply text generation can be implemented through retrieval-based, model-based generation, and the like.

Among them, for the retrieval-based text generation method, the specific method can be as follows: prepare a corpus of question and answer pairs in advance, and find the best match between the corpus and the current question when generating the response, and then return the corresponding answer As reply text.

For the method of generating a reply text based on a model, a specific method may be: training a neural network model according to a large number of question and answer pairs in advance, and using the question as an input to the neural network model in the process of generating the reply text. , And calculate the corresponding reply answer, and the reply answer can be used as the reply text.

Step 108: The terminal determines the TTS parameters required for the current reply text.

In specific embodiments, on the one hand, the terminal can determine the basic TTS parameters associated with the current user identity through the PM module (or basic TTS parameter management module) of the voice response system, such as the preset pitch, preset speed, and preset volume. Basic TTS parameters, and the pronunciation of target characters (strings) in the text; on the other hand, the PM module (or enhanced TTS parameter management module) of the voice response system can be used to determine the corresponding enhanced TTS parameters based on the content of the reply text and context information , Such as voice emotion parameters, voice scene parameters, background sound effects, etc.

In the specific embodiment of the present invention, the content of the reply text suitable for superimposing the background sound effect may be a poem, a film or television line, or a text with emotional polarity. It should be noted that related content about background sound effects will be described in detail later, and will not be repeated here.

Step 109: The terminal selects an acoustic model from a preset acoustic model library according to the current input voice. This step may also be performed before step 108.

Specifically, the terminal is preset with an acoustic model library, and the acoustic model library may include multiple acoustic models, such as a general acoustic model and several personalized acoustic models, and so on. These acoustic models are all neural network models, and these neural network models can be trained in advance from different corpora. For each acoustic model, each acoustic model has its own preset information, and these preset information can be used as the basic input information of the acoustic model. For example, the preset information of the general acoustic model may include two or more of the preset sound speed, preset volume, preset pitch, preset tone, preset tone, and preset rhythm of the model; personality The preset information of the acoustic model includes two or more of the preset sound speed, preset volume, preset pitch, preset tone, preset tone, and preset rhythm. Other personalized information, such as mantras, ways of responding to specific scenes, types of wisdom, personality types, mixed popular languages or dialects, appellations of specific characters, and other language style characteristics.

In the embodiment of the present invention, the acoustic model can convert the reply text into a reply voice according to the preset information and the change information of the preset information. The change information of the preset information referred to here means information such as a basic TTS parameter, an enhanced TTS parameter, a mapping relationship between a target character and a user's preferred pronunciation, and a background sound effect selected in speech synthesis. The speech synthesized through the general acoustic model presents the sound effects under normal and general dialogue scenarios, while the speech synthesized through the personalized acoustic model can "simulate" the sound effects of dialogue scenarios. The implementation method of the dialogue scene of "character imitation" will be described in detail later.

In a specific embodiment, the terminal selecting an acoustic model from a preset acoustic model library according to the current input voice includes: the terminal determining an acoustic model preferred by the user according to the identity of the user; The acoustic model selected by the user is selected from a plurality of acoustic models.

In yet another specific embodiment, the terminal selecting an acoustic model from a preset acoustic model library according to the current input voice includes: the terminal determining, according to the content of the current input voice, a relationship related to the content of the user's input voice An acoustic model identification; the identification of the acoustic model is used to uniquely characterize the acoustic characteristics of the acoustic model. For example, the identification of an acoustic model is "Lin Zhiling", indicating that the acoustic model is used to synthesize "Lin Zhiling" type Sound; the identification of an acoustic model is "Little Shenyang Ling", indicating that the acoustic model is used to synthesize "Little Shenyang" type sounds, and so on. In this way, if the content of the input speech is related to "Lin Zhiling", the acoustic model with the "Lin Zhiling" logo can be selected.

In still another specific embodiment, the terminal selecting an acoustic model from a preset acoustic model library according to the current input voice includes: the terminal determining a weight value of each of the plurality of acoustic models according to the identity of the user; Wherein, the weight value of each acoustic model is preset by a user, or the weight value of each acoustic model is determined in advance by learning user preferences. Then, the respective acoustic models are weighted and superimposed based on the weight values to obtain a comprehensive acoustic model (which may be referred to as a fusion model), and the fusion model is selected.

Step 110: The terminal generates a corresponding calling instruction according to the reply text and the determined TTS parameters.

In a specific embodiment, the terminal may generate a call instruction required by the TTS engine according to a reply text, a determined TTS parameter, and the like through an instruction generation module of the voice response system.

For example, referring to FIG. 17, in an application scenario, when the content of the input voice of the user “xiaoming” is “What is the last sentence of someone in the depths of Baiyun”, the corresponding response text generated by the terminal is: “ The last sentence of "Someone in the depths of Baiyun" was "Distant slope of the cold mountain trail"; the TTS parameters determined by the terminal and the call instructions generated by the terminal based on the reply text and the determined TTS parameters can be exemplarily referred to the chart shown in Fig. 17 The description is not repeated here.

For another example, referring to FIG. 18, in another application scenario, when the input voice of the user "xiaoming" is "Can the voice be a little louder", the corresponding response text is: "The volume has been turned up a bit" ; The TTS parameters determined by the terminal, and the call instruction generated based on the reply text and the determined TTS parameters can be exemplarily referred to the description of the chart shown in FIG. 18, and are not repeated here.

For another example, referring to FIG. 19, in another application scenario, when the input voice of the user "xiaoming_mom" is "Speak too slowly, please speak faster", the corresponding response text is: "Speaking speed Already faster "; the TTS parameters determined by the terminal and the call instructions generated based on the reply text and the determined TTS parameters can be exemplarily referred to the description of the chart shown in FIG. 19 and will not be repeated here.

Step 111: The terminal performs a speech synthesis operation based on the calling instruction. Specifically, the terminal uses the acoustic model to perform a response to the reply text according to preset information of the acoustic model, the basic speech synthesis information, and the enhanced speech synthesis information. Perform speech synthesis to get the reply speech.

In a specific embodiment, the terminal may use the TTS engine of the voice response system to call the acoustic model determined in step S109 to perform a speech synthesis operation, so as to perform a speech synthesis based on the preset information of the acoustic model and related TTS parameters to obtain a reply. voice. The TTS engine may be a system constructed based on a statistical parameter synthesis method, which can fully consider various TTS parameters to synthesize different styles of speech.

Step 112. The terminal returns a reply voice to the user.

In a specific application scenario, the terminal may play the reply voice to a user through a speaker. In a possible embodiment, the terminal may further display a reply text corresponding to the reply voice through a display screen.

In order to better understand the scheme for updating the TTS parameters in the embodiment of the present invention, the following will take the updating of a custom character pronunciation table as an example to describe in detail the correction of a target character (such as a polyphonic character) specified by the user based on steps S104-S106 of the embodiment of FIG. Pronunciation process. Referring to FIG. 20, the process includes but is not limited to the following steps:

Step S201. This step is a specific refinement of step S104 in the embodiment of FIG. 10 described above. In this step, the terminal recognizes that the user's speaking intention is to correct the pronunciation of the target character, such as correcting the polyphony of one or more polyphonic characters.

In specific implementation, it is assumed that the user ’s speech content is “wrong, should be read as xiao3 qian4, not xiao3 xi1”. After the terminal analyzes the recognized text through the NLU module, it recognizes that the keyword “wrong” "Should be read." Then, the intent recognition module uses these keywords to match the preset sentence template "... read / read / call / speak wrong ... should read / read / call / speak ... not ..." to determine the current user's speaking intentions To "correct the pronunciation of the target character" (that is, the TTS parameter needs to be updated).

Step S202. This step corresponds to step S105 in the embodiment of FIG. 9 described above, that is, the terminal determines whether the user's speaking intention is to update the TTS parameter.

Steps S203 to S205. These steps correspond to step S106 in the embodiment of FIG. 10, that is, the terminal performs an update operation of the TTS parameter indicated by the speaking intention. Steps S203-S205 are described in detail as follows:

Step S203. The terminal extracts misreading and target pronunciation.

In specific implementation, the terminal's intent recognition module may mark "xiao3xixi1" as a misreading pronunciation and "xiao3qian4" as a target pronunciation based on the matched preset sentence template.

Step S204. The terminal determines a target word (that is, a target character to be corrected) according to the misreading pronunciation and context information.

In specific implementation, the terminal's DM module can find the dialogue text output by the terminal in the last round or previous rounds of conversations in the context information, and determine the pronunciation of each word in the dialogue text (such as using an acoustic model to determine the pronunciation ). For example, the output text of the terminal in the last round of conversation was "I'm glad to meet you, Xiao Qian", and the terminal determined that its corresponding pronunciation is "hen3, gao1, xing4, ren4, shi2, ni3, xiao3xi1". In this way, the DM module matches the misreading pronunciation with the pronunciation string of the output text, and can determine that the Chinese word corresponding to the misreading pronunciation "xiao3xi1" is "Little Qian", that is, "Little Qian" is the target Word (i.e. the character to be corrected).

Step S205. The terminal adds the target word and the target pronunciation to a customized character pronunciation list associated with the identity of the user.

In a specific embodiment, the terminal adds the target word "小茜" and the target pronunciation "xiao3 qian4" as new target character-phonetic pairs to a custom character pronunciation table associated with the current user identity through the PM module. Understandably, in future man-machine conversations, when the terminal's response text contains "Xiao Qian", the PM module will determine that the pronunciation of "Xiao Qian" is "xiao3qian4" according to the records of the customized character pronunciation table.

It can be seen that by implementing the technical solution of the embodiment of the present invention, the terminal can, in a voice conversation, allow the user to tune the terminal's voice response system through voice in real time, and correct the target character specified by the user (such as polyphonic characters) based on the user's intention To update the TTS parameters associated with the user ’s identity and preferences, making the tuned terminal closer to the user ’s interactive preferences and maximizing the user ’s interactive experience.

In order to better understand the scheme for adaptively selecting TTS parameters according to the user or the current conversation context in the embodiment of the present invention, the detailed implementation process of step S108 in the foregoing embodiment of FIG. 10 is described in detail below. Referring to FIG. 21, the process may include the following steps:

Step 301. This step is a refinement of step S103 in the embodiment of FIG. 10 described above. In this step, the terminal determines whether the user identity of the current user is registered (or whether the identity verification is passed).

Step 302. If the terminal determines that the user identity of the current user is registered, read the basic TTS parameters associated with the user.

As shown in FIG. 11, for example, if the current user is "xiaoming_grandma", the basic TTS parameters associated with the user "xiaoming_grandma" can be found in the TTS parameter database: the preset coefficient of change of the sound velocity is -40%, and the preset coefficient of change of the volume It is + 40%, and the preset pitch variation coefficient is + 20%.

Step 303. If the terminal determines that the user identity of the current user has not been registered (or has not passed identity authentication), it obtains default basic TTS parameters.

For example, the current user is xiaohua. Since the identity of "xiaohua" has not been registered and does not exist in the TTS parameter database, the corresponding default values for unregistered users can be returned (as shown in Figure 10, preset sound speed, preset volume, preset Let the pitch change coefficients be 0) as the basic TTS parameter of the current user.

Step 304: The terminal compares the reply text with the custom character pronunciation table associated with the current user, and determines whether there are any words / words / symbols matching the custom character pronunciation table in the text, and if so, obtains the word. / Word / symbol The target pronunciation.

For example, as shown in Fig. 12, if the current user is "xiaoming" and the current reply text contains "piglet peculiar", because it exists in the custom character pronunciation table associated with "xiaoming", this four The pronunciation of the word is marked as the corresponding pronunciation in the table: xiao3, zhu1, pei4, and ki1.

Step 305: The terminal obtains the speech emotion parameters in the corresponding enhanced TTS parameters from the TTS parameter database according to the reply text.

In a specific embodiment, the DM module may be preset with an emotional recommendation model, which is trained based on a large number of dialogue texts with emotional tags. Therefore, the DM module inputs the response text into the emotion recommendation model, and can determine the emotion category (such as happiness, sadness, etc.) and the degree of emotion (such as mild happiness, moderate happiness, etc.) of the current response text. Then, the PM module determines the speech emotion parameters from the emotion parameter correction mapping table of the TTS parameter database according to the emotion recommendation result of the DM module. For example, if the current response text is "That's great" and the emotion recommended by the emotion recommendation model for the response text is "moderately happy", the PM module obtains the "medium" in the emotion parameter correction mapping table shown in FIG. Degree of happiness "corresponding to speech emotion parameters.

Step 306: The terminal obtains the voice scene parameters in the corresponding enhanced TTS parameters from the TTS parameter database according to the reply text and the context information.

In a specific embodiment, the DM module may determine the scene of the current conversation according to the context information of the current conversation and the reply text. Furthermore, the PM module can obtain the voice scene parameters in the corresponding enhanced voice parameters according to the determined dialogue scene. For example, the current reply text is a specific seven-character poem (for example, “Mengbo Dongwu Wanli Ship”), and the DM module determines the current dialogue scene as an ancient poem Solitaire scene based on the context information of the dialogue and the reply text. The scene positioning voice scene is "poem recitation", and further, the PM module obtains a voice scene parameter corresponding to "poem recitation" in the scene parameter correction mapping table shown in FIG. 15. For another example, if the context information of the dialogue before the PM module and the reply text determine that it is currently a children's song scene, the voice scene is positioned as "song humming", and the PM module obtains "song humming" in the scene parameter correction mapping table shown in FIG. 15 "Corresponding voice scene parameters. As another example, if the context information of the dialogue before the PM module and the reply text determine that it is currently a character imitation scene, the voice scene is positioned as "character imitation", and the PM module obtains "character imitation" in the scene parameter correction mapping table shown in FIG. 15. Corresponding voice scene parameters, and so on.

It can be seen that by implementing the technical solution of the embodiment of the present invention, the terminal can select different TTS parameters for different users (such as basic TTS parameters, user-preferred pronunciation of target characters, and voice emotions) based on the interactive text and dialog context information of the dialog Parameters, voice scene parameters, etc.), so as to automatically combine user preferences and conversation scenarios to generate different styles of reply speech, provide personalized voice synthesis effects to different users, greatly improve the voice interaction experience between users and terminals, and improve human-machine dialogue Timeliness to improve user interaction experience.

In order to better understand the technical solution of the embodiment of the present invention, the speech synthesis method of the embodiment of the present invention is described below by taking a speech scene of "poem recitation" as an example. Referring to FIG. 22, the method can be described by the following steps:

Step 401: The terminal presets a voice scene parameter of "poem recitation".

In a specific embodiment, the TTS parameter database of the terminal is preset with a voice scene parameter of "poem recitation". The speech scene of "Poetry Recitation" focuses on the rhythm of speech, and the speech scene parameters of "Poetry Recitation" are used to adjust the speech pause position / pause time (that is, word segmentation of text content), words or words of input text that conform to a specific syntactic format. Read the length and stress position aloud to strengthen the rhythm. Compared with the natural rhythm of normal dialogues, the enhanced rhythmic rhythm has a clearer and stronger emotional expression. For example, when reading specific poems, nursery rhymes, and other specific syntactic formats, the enhanced rhythmic rhythm can produce "Suppression and frustration" feeling.

In specific implementation, the voice scene parameters of the "Poetry Recitation" can be realized by a rhythmic rhythm template. For each specific literary style (or syntax format) text content, it can correspond to one or more rhythmic rhythm templates. For each rhythmic rhythm template, it defines the volume change of the word (i.e. the weight of the word) and the change of the sound length (i.e. the length of the word's pronunciation time) at each position in the template, and the pronunciation of the word in the text. Pause position / pause time (ie word segmentation of text content). The generation of prosody templates can be achieved in the following two ways:

One is to use the existing grammatical rules or conventional grammars and rules to obtain the prosodic template associated with the syntactic format. For example, for the prosodic rhythm of five-character quatrains (such as "Bai Ri Yi Shan Jin"), the word segmentation method can have two methods: "2 words-3 words" and "2 words-2 words-1 word". The corresponding reading time of each word can be "short-long-short-long" and "short-short-long-long", and the pronunciation of each word can be "light-light-light-weight" and "light-weight" Light-light-heavy. "

The other is training and learning based on the special rhythmic corpus of voice models reading, and based on statistics, machine learning, and deep network frameworks to obtain models including pause positions, word or word reading lengths, and accent positions. After the model is trained, the text content that needs to be applied to the "Poetry Recitation" mode is input to the model, and the prosody template corresponding to the text content is obtained.

Step 402: The terminal determines that the voice scene of the current conversation is a voice scene of "poem recitation" by replying to text and context information.

In a specific embodiment, the terminal may determine that the voice scene of the current conversation is a voice scene of "poem recitation" through the DM module. Specifically, the manner in which the DM module determines the current scene as a voice scene of "poem recitation" may include the following:

One way is that during the dialogue, the user's input voice contains the user's intention to clearly indicate that the current dialogue is a "poem recitation". The DM module combined the intent recognition module to determine the user's intention, and then determined that the current dialogue is a "poem." Recitation "of the voice scene. For example, if the user inputs a voice to instruct the terminal to perform Tang poetry recitation or ancient poem Solitaire, then the terminal automatically sets the current dialogue scene as the "poem recitation" voice scene after recognizing the user's intention.

One way is that in a normal conversation, although the user has no clear intention to clearly indicate that the current conversation is "poem recitation", the terminal can still determine whether the content of the reply text involves poems, words, songs, fu, etc. through the DM module. One or more of specific literary styles, such as five-character quatrains or seven-character quatrains or rhythmic poems, or specific word or song cards. In specific implementation, the DM module can search local pre-stored libraries or search libraries in the web server through text search matching or semantic analysis. The library can contain a variety of literary styles corresponding to literary knowledge materials. The DM module further Determine whether the content of the reply text exists in the library, and if so, set the current dialogue scene to the voice scene of "Poetry Recitation".

Another way is to store the literary style features such as the number of words, the number of sentences, and the order of the number of words in each sentence in advance in various literary styles (or syntax formats). The DM module can analyze punctuation (pauses), word counts, The number of sentences, the order of the number of words in each sentence, etc., match a piece of text or all the text in the reply text with the pre-stored literary style feature. If the match is successful, the piece of text or all text that matches the pre-stored literary style feature. It can be used as the text of the voice scene of "Poetry Recitation". For example, the literary style characteristics of five-character quatrains include: 4 sentences, each sentence is 5 words, a total of 20 words. The literary style features of the five-character poetry include: 8 sentences, each sentence is 5 words, a total of 40 words. The literary style characteristics of the seven-character quatrains include: 4 sentences, each sentence is 7 words, a total of 28 words. For another example, the literary style features of Song Ci Xiaoling's "Rumengling" include: 7 sentences, each of which has 6 characters, 6 characters, 5 characters, 6 characters, 2 characters, 2 characters, and 6 characters. If a piece of text in the reply text reads: “The mountains are like daisies outside the window, the classroom is boring. The teacher on the stage speaks at a high speed. It ’s fast, really fast, and it ’s hard to catch a horse.”, Then the DM module can determine the literary style characteristics In line with the literary style characteristics of "Ru Meng Ling", the current dialogue scene is set as the voice scene of "Poetry Recitation".

Step 403: The terminal determines a voice scene parameter corresponding to the current "poem recitation" voice scene.

In a specific embodiment, the terminal determines a voice scene parameter corresponding to the current "poem recitation" voice scene through the PM module.

In a possible implementation manner, the literary style (or literary style feature) is associated with the rhythm template. Then after determining the literary style (or literary style feature) involved in the current reply text, the PM module can obtain the prosodic rhythm template associated with it from the TTS parameter database, and the prosodic rhythm template contains the corresponding voice scene Parameters (that is, including rhythmic rhythm change information), specifically, the voice scene parameters include information such as volume changes and sound length changes of words at various positions in the template, and pause positions / pause times of speech in the text (parameters) . For example, for the prosodic rhythm template of five-character quatrains, the speech scene parameters corresponding to the prosodic rhythm template include specific word segmentation methods, the length of time in which each character is read aloud in each sentence, and information on the pronunciation of each character.

In another possible implementation manner, the selection of the voice scene parameters may also be closely related to the voice emotion parameters, that is, different emotion categories (such as happiness, sadness), and different emotion levels (such as mild happiness, moderate happiness) Both may affect the voice scene parameters, that is, the specific parameters of the prosody template corresponding to the literary style (or literary style features). The advantage of this design is that the voice scene can be closer to the current voice emotion, which is conducive to the final voice output more vivid and reasonable.

For example, for a rhythmic template of five-character quatrains, the standard parameters include: "2 words-3 words" in the word segmentation method, and the corresponding reading time of each word can be "short- "Short length", the corresponding pronunciation of each word can be "light weight-light weight". Then, under different speech emotion parameters, the final speech presentation of the prosodic rhythm template will also be different, and this difference can exist in changes such as hyphenation, tonality, and stress. As shown in Table 1 below, Table 1 shows a prosodic rhythm template for five-character quatrains, and different phonetic emotions affect the prosodic rhythm template. Among them, the voice emotion 1, voice emotion 2, and voice emotion 3 listed in Table 1 may indicate emotion categories (such as happiness, neutral emotion, sadness), and may also indicate emotion levels (such as mild happiness, moderate happiness, and extreme happiness). ). Therefore, for the determined rhythmic template, the PM module can determine the final speech scene parameters from the rules similar to those shown in Table 1 according to the speech emotion parameters of the reply text.

Table 1

Zh		语音情感1Speech emotion 1	语音情感2Phonetic Emotion 2	语音情感3Speech emotion 3
2字与3字之间间隔Space between 2 and 3 words	1.1倍标准间隔时长1.1 times the standard interval time	1.2倍标准间隔时长1.2 times the standard interval time	1.3倍标准间隔时长1.3 times the standard interval time
重读发音增加程度Increased pronunciation	1.05倍音量1.05 times the volume	1.10倍音量1.10 times the volume	1.15倍音量1.15 times the volume
音调变化幅度Pitch change	1.2倍基音标准差1.2 times the standard deviation of the pitch	1.4倍基音标准差1.4 times the standard deviation of the pitch	1.6倍基音标准差1.6 times the standard deviation of the pitch

It should be noted that the present invention is not limited to the embodiments shown in Table 1 in terms of combining speech emotions and prosodic rhythm templates. In other possible embodiments, a support vector machine (Support Vector Machine) is also adopted by deep learning. Machine (SVM) or deep neural network for model training based on a large number of prosodic rhythm templates corresponding to different speech emotions to obtain a trained deep learning model. In this way, the terminal can use the standard prosodic rhythm template corresponding to the response text and the response The speech emotion parameters corresponding to the text are input to the deep learning model together to obtain the final speech scene parameters.

Step 404: The terminal aligns the rhythmic template of the content of the reply text to facilitate subsequent speech synthesis.

In a specific embodiment, when speech synthesis is required, the terminal may align the relevant content in the reply text with the rhythmic template of the "poem recitation" voice scene. Specifically, the terminal may combine the pronunciation of the corresponding acoustic model library with the relevant content in the reply text and the parameters of the prosodic rhythm template, and refer to a certain scale to superimpose the parameters of the prosodic rhythm template into these pronunciation segments.

For another example, referring to FIG. 23, the reply text includes text content such as "Bai Ri Yi Shan Jin". "Bai Ri Yi Shan Jin" belongs to the first sentence of five-character quatrain poems. If only the general acoustic model is used, The reply text is synthesized by speech, then the synthesized speech (which can be called basic pronunciation segment) is "bai2, ri4, yi1, shan1, and jin4", and the basic pronunciation of each character is 0.1s. The basic pronunciation of each character is the default. The interval is 0. In the embodiment of the present invention, the terminal uses the prosodic rhythm template corresponding to the five-character quatrains in the selection of the TTS parameters, so that in the subsequent process of synthesizing the reply text through the general acoustic model, the five-character quasi-correspondence is additionally used The rhythmic rhythm template superimposes this basic pronunciation segment, so that in the final synthesized speech, as shown in FIG. 23, in terms of reading time, the length of the pronunciation of different words in the segment is lengthened to different degrees, respectively. (For example, the sound length of "ri4" becomes 0.17s, the sound length of "shan1" becomes 0.14s, and the sound length of "jin4" becomes 0.17s); in terms of word segmentation, "bai2 ri4" and "yi1 shan1 jin4" There was a pause between 0.02s; in terms of pronunciation, "ri4" and "jin4" were both accented. That is, in the embodiment of the present invention, after the content of the reply text is aligned with the rhythmic rhythm template, the speech obtained by subsequent speech synthesis through the TTS module will be able to present the effect of a "poetry recitation" speech scene.

The following describes a speech synthesis method according to an embodiment of the present invention by using a voice scene of "song humming (take a nursery rhyme as an example)" as an example. Referring to FIG. 24, the method can be described by the following steps:

Step 501: The terminal presets a voice scene parameter of "child songs humming".

In specific implementation, the TTS parameter database of the terminal is preset with voice scene parameters of “Children's Song Humming”. In music, time is divided into equal basic units, and each basic unit is called a "beat" or a beat. The time value of the beat is expressed by the time value of the note. The time value of a beat can be a quarter note (that is, a quarter note is a beat), a half note (a quarter note is a beat), or eight. Quarter note (takes eighth note as a beat). The rhythm of music is generally defined by beats, such as 4/4 beats: 4/4 beats are 4 quarter notes as a beat, and 4 beats per measure can have 4 quarter notes. The so-called preset "song song humming" voice scene parameters, that is, preset a variety of types of children's songs beat type, and the text segmentation of the content of the reply text that needs to be synthesized in a "song song humming" manner.

In a specific embodiment, for the voice scene of “Children ’s Song Humming”, the beat of the children ’s song may be determined according to the number of words in the two punctuation points or the number of words in each field after the word segmentation. For example, for this kind of children's song type response text "Little swallow, wear flower clothes, come here every spring, ask swallow why do you come, swallow said, spring is the most beautiful here", you can use the following two ways Text-splitting the response text to determine the best-matching beat:

One way is to cut the reply text according to the punctuation marks, that is, to identify the punctuation marks in the reply text, and the number of words in each field divided by each punctuation mark is "3,3,7,8,3,8" It can be seen that the field with the word number "3" appears the most, so it can be determined that the beat that most closely matches the reply text is a multiple of 3, such as 3/3 beat, 3/4 beat, and so on.

Another way is to segment the reply text according to the word segmentation result. The word segmentation result is, for example, "little / swallow / wear / flower clothing / yearly / spring / come / here / to / question / swallow / you / why / come / swallow / Say / here / of / spring / most / beautiful ", in order to maintain semantic coherence, the results of the segmentation can be adjusted, and the verbs, adjectives and adverbs that modify the noun are connected with the modified noun and merged into one word. After processing, the previous participle result further changed to "little swallows / wearing flowers / yearly / spring / coming here / to / questioning swallows / why are you / coming / swallows saying / here / spring / most beautiful", divided The number of words in each subsequent field is "3,3,2,2,3,1,3,3,1,3,3,2,3", as can be seen, the field with the word number "3" appears the most Therefore, it can be determined that the beat that most closely matches the reply text is a multiple of 3, such as 3/3 beat, 3/4 beat, and so on.

Step 502: The terminal determines that the voice scene of the current conversation is a voice scene of "humor songs" by replying to text and context information.

In a specific embodiment, the terminal may determine, through the DM module, that the voice scene of the current conversation is the voice scene of "child songs and humming". Specifically, the manner in which the DM module determines the current scene as a voice scene of "Children's Song Humming" may include the following:

One way is that during the conversation, the user's input voice contains the user's intention to clearly indicate that the current conversation is a “child song humming”. After the DM module determines the user ’s intention in conjunction with the intent recognition module, the current conversation is determined as "Song of children's songs". For example, if the user inputs a voice to instruct the terminal to sing children's songs, then after the terminal recognizes the user's intention, it automatically sets the current dialogue scene to the voice scene of "children's humming".

One way is that in a normal conversation, although the user does not have a clear intention to explicitly indicate that the current conversation is "Children's Song Humming", the terminal can still determine whether the content of the reply text involves the content of children's songs through the DM module. In specific implementation, the DM module can search the local pre-stored nursery rhyme library or search the nursery rhyme library in the web server through text search matching or semantic analysis. The lyric library can contain various lyrics of the nursery rhymes, and the DM module judges Whether the content of the reply text exists in these children's song lyrics, and if it exists, the current dialogue scene is set to the voice scene of "Children's Song Humming".

Step 503: The terminal determines a voice scene parameter corresponding to the current "Children's Song Mode".

In a specific embodiment, the terminal determines a voice scene parameter corresponding to the current "children's song mode" through a PM module. Specifically, the PM module may determine a text segmentation method according to the content of the reply text (refer to the two methods described in step 502 above), and use this method to perform text segmentation on the reply text to obtain a segmentation result. Then, the best matching beat is determined according to the segmentation result.

Step 504: The terminal aligns the content of the reply text to facilitate subsequent speech synthesis.

In a specific embodiment, the terminal may align the content of the reply text with the determined beat through the PM module, so as to ensure that each field of the text is fused with the changing rule of the rhythm of the nursery rhyme. Specifically, the terminal aligns the cut text field with the time axis according to the change rule of the beat.

For another example, the number of words in a field in the reply text is less than the number of beats in the measure. If the field is 2 words and the beat is 4/4 beats, then search for adjacent text fields before and after the field. The number of words in the field before the field (or the field after the field) is 2, then you can merge this field with the field before the field to align the 4 beats in the measure together. If the fields before and after cannot be merged, or the number of words after the merge is still less than the number of beats, you can further align the beats in the following ways.

One way is to fill the text with less than the number of ticks with a blank. Specifically, if the number of words matched with one bar of music is less than the number of beats, as long as each word corresponds to the position of one beat in time during matching, the remaining part is filled with mute. As shown in (a) of FIG. 25, for the field "Little White Rabbit" in the reply text, the matching beat type is 4/4 beat, then the "small", "white", and "rabbit" can be aligned with the The first, second, and third beats, and finally the mute to complete the fourth beat. It should be noted that this figure only shows an implementation situation. In actual operation, the mute may be any one of the first to fourth beats.

Another way is to align the rhythm by lengthening the sound length of a word. Specifically, when the number of words matched by a measure of music is less than the number of beats, the purpose of aligning the words and the beats can be achieved by lengthening the reading time of one or more words. As shown in (b) of FIG. 25, for the field "Little White Rabbit" in the reply text, the matching beat type is 4/4 beats, then the "small" and "white" can be aligned with the first beat in the measure, respectively. 2. On the second beat, stretch the pronunciation of "rabbit" so that "rabbit" is aligned with the third and fourth beats. It should be noted that this figure only shows an implementation situation. In actual operation, the object of the elongation processing of the pronunciation may be any word in "Little White Rabbit".

Another way is to lengthen the sound length of each word evenly to ensure the overall time alignment. Specifically, a method of extending the pronunciation time of each character in the text field averagely may be used to align the pronunciation time of the character with the beat of the music. As shown in (c) of FIG. 25, for the field "Little White Rabbit" in the reply text, the matching beat type is 4/4 beats, so the reading time of each word can be lengthened to 4/3 beats. The duration can ensure that the entire field is aligned to the beat.

The following uses the acoustic model for implementing "person imitation" as an example to describe the speech synthesis method of the embodiment of the present invention. Referring to FIG. 26, the method can be described by the following steps:

Step 601: The acoustic model library of the terminal presets an acoustic model for implementing "character imitation".

In a specific embodiment, the acoustic model library of the terminal is preset with various acoustic models (i.e., personalized acoustic models) for implementing "character imitation". The acoustic model of "character imitation" can be used to make the synthesized speech have the sound characteristics of a specific character. Therefore, the information of the preset voice, preset tone, and preset rhythm of the acoustic model of "character imitation" and this information of the general acoustic model will be There are differences. The character imitated by the "personal imitation" acoustic model may be the user's favorite personal image, a character in a film or television work, or a combination of multiple preset sound modes and user preferences. For example, these "persons" The acoustic model of “imitation” can be an acoustic model in which the user imitates the user's own speaking style; it can also be an acoustic model that mimics the speaking characteristics of other characters, for example, an acoustic model used to imitate “Lin Zhiling / Soft Voice”, which can imitate “Little Shenyang / The acoustic model of "funny sound" can be an acoustic model that mimics "Andy Lau / Deep Voice", and so on. In addition, in a possible embodiment, during the speech synthesis process, the terminal does not select a specific acoustic model in the acoustic model library, but a comprehensive model of multiple acoustic models in the acoustic model library.

In the acoustic model library, in addition to the acoustic models that can preset some specific character sound characteristics, different voice characteristics and different language style characteristics can be combined according to user preferences or needs, so as to form individual characteristics. Acoustic model. Among them, the characteristics of speech include the speed of speech (sound velocity), intonation, rhythm, tone color, etc. Among them, the change in tone color is that in addition to a 'basal tone', the sound naturally adds many different 'sound frequencies' and overtones' 'Interlaced', determines different tones, so that people can distinguish different sounds after listening. The characters represented by these different sounds can be natural persons (such as users, sound models, etc.), or they can be animated characters or virtual characters (such as robot cats, Luo Tianyi, etc.). Linguistic style features include mantras (including common mood words), response characteristics to specific scenes, types of wisdom, personality types, popular languages / dialects mixed in speech, and titles of specific characters. That is to say, the acoustic model that combines different voice characteristics and different language style characteristics according to the user's preferences or needs, in addition to the preset information includes preset sound speed, preset volume, preset pitch, preset Two or more of the tone color, preset intonation, and preset prosody rhythm also include language style characteristics.

These language style features are described in detail below:

User mantras are sentences that users are used to saying intentionally or inadvertently. For example, in the mood of surprise, some people will add a sentence "Are you mistaken?" In front of a sentence, and some people are often in the middle of sentences Add uncertain words such as "may" and "maybe". In addition, the mantra may also include common mood words, such as the iconic mood word "嚎" of the comedian Xiaoyang, which often appears at the end of sentences.

The response to a specific scenario refers to the response most commonly used by a person in a specific scenario or to a specific question. For example, to a question like “Where to eat”, a person ’s response to a particular scenario may be “any casual”; for another question to a question “What beer do you want”, a person ’s response to a specific scenario may be “Tsingtao Beer”, etc. Wait.

The type of wisdom is used to distinguish the tendency of different groups to understand the different ways of presenting content. The type of wisdom further includes the following types: language intelligence type, such people have strong reading ability, like to read text descriptions, play word games, and be good at it Write poems or write stories; logical and mathematical intelligence types, such people are more sane, good at calculations, and sensitive to numbers; music intelligent types, such people are sensitive to melody and sound, like music, and learn more efficiently when there is music in the background High; space intelligence type, such people are sensitive to the surrounding environment, like to read charts, good at painting; sports intelligence type, such people are good at using their own body, like sports, hands-on production; intelligent type of interpersonal relationship, such people Good at understanding and communicating with others; introspective intelligence types, such people like to think independently and set their own goals; natural observer intelligence types, such people are interested in natural creatures on the planet. For the same question, people of different types of intelligence will have different answers. For example, for the question "How many stars are there in the sky?", The answer of the person of logical and mathematical intelligence type may be "the stars visible to the naked eye are 6,974 "And the answer for people with intelligent language types may be" seven or eight stars away, two or three points before the rainy mountain. "And for people with intelligent music types, they may answer" the countless stars in the sky, the brightest. " It's you "(song" Gemini "), and so on.

Personality types refer to different language styles corresponding to people with different personalities. For example, a person with a stable personality has a strict language style; a person with a lively personality has a humorous language style; a person with an introverted personality has a euphemistic language, and so on.

Speaking mixed dialects mean that a person likes to mix with native dialects or foreign languages when speaking. For example, when you are thankful, you like to use Cantonese “唔” or English “Thank you”. The inclusion of popular language in speaking means that when a person speaks, he or she likes to use the current popular words or network words instead of specific words. For example, when a person is sad, they say "blue thin mushrooms" instead of "uncomfortable".

The title of a specific person refers to the use of a specific name for a specific person, for example, the user calls a specific person Wang Xiaoming as "Mr. Wang" or "Lao Wang" and so on.

In a specific embodiment of the present invention, the voice response system of the terminal can obtain the voice characteristics and language style characteristics associated with the user identity through learning. In specific implementation, user preferences can be acquired and analyzed through feature migration in advance, that is, user needs can be determined according to the user's acquisition of information in other dimensions, thereby further inferring and judging the voice features that users may like. And language style characteristics.

For example, the characteristics of the song that the user likes can be analyzed and counted, and the speed of the speech (sound rate) of the synthesized speech and the strength of the rhythm are determined according to the rhythmic strength of the song; the synthesis is determined according to the voice characteristics of the singer corresponding to the song The timbre characteristics of speech; determine the linguistic style characteristics of synthesized speech according to the style characteristics of the lyrics of the song. As another example, it can analyze and count the features of the user's favorite TV programs, social media content and other dimensions, and train a feature transfer model, so that the model can be used to infer the user's favorite voice features and language style features.

In a specific embodiment of the present invention, the terminal's voice response system can also obtain and analyze user preferences through multi-modal information, that is, by analyzing the user's expressions, attention levels, and operating behaviors, it automatically analyzes and infers the user's response to synthesis. Preference or demand for phonetic features. Through multi-modal analysis, not only can the user's demand for synthesized speech be collected before generating personalized synthesized speech, but also the user's preference for the speech can be continuously tracked after the personalized speech is generated, and the synthesis can be iteratively optimized based on this information Features of speech.

For example, by analyzing the emotions of users when they hear different synthesized voices, they can indirectly obtain the user's preference for different voices; for example, by analyzing the degree of attention of users when they hear different synthesized voices (the degree of attention can be Obtained through the user's facial expression information, or through the EEG or bioelectric signals obtained by the user's wearable device) to indirectly obtain the user's preference for different voices; for example, the user's Operating habits (such as skipping a voice or playing a voice quickly may indicate that the user does not like the voice very much) to indirectly obtain the user's preference for different synthesized voices.

The following describes the acoustic model with the sound characteristics of a specific character and a comprehensive model (or fusion model) obtained by fusing multiple acoustic models.

(1) For the acoustic model with the sound characteristics of specific characters, compared with ordinary people, the voices of characters (such as Lin Zhiling) or dubbing (such as the dubbing of Zhou Xingchi) in movies, TV series, cartoons, online video works, etc. More expressive, more vivid and interesting. In addition, the classic lines in many film and television works can bring direct and strong emotional expression. Based on people's recognition of the emotions expressed by these personas or dubbing or lines, an acoustic model of specific person's voice characteristics can be set to make the pronunciation characteristics of synthesized speech conform to the sound characteristics of these personas or dubbing or lines, thereby effectively enhancing the synthesis Speech performance and fun.

(3) For the comprehensive model obtained by the fusion of multiple acoustic models, because there are multiple acoustic models in the acoustic model library, the user's preferences or needs for speech can be obtained in advance, and then several of the multiple acoustic models can be obtained. Models can be fused, for example, an acoustic model that imitates "Lin Zhiling / Soft Voice" and an acoustic model that imitates "Little Shenyang / Funny Voice"; for example, the user's own voice characteristics, language style characteristics, or users' favorite The vocal characteristics and language style characteristics of the character image are combined with the sound models (such as the acoustic model of "Lin Zhiling / Soft Voice" and the acoustic model of "Little Shenyang / Funny Voice") corresponding to the character image in some film and television works to obtain The final acoustic model is used for subsequent speech synthesis.

The following describes a specific model fusion method. In this method, the sounds of multiple personalized acoustic models in the acoustic model library can be used to achieve thick, soft, cute, funny and other types of sounds. After acquiring the user's preferences or needs for the voice (these preferences or needs are directly related to the identity of the user), the terminal determines the user's respective preference coefficients for the several acoustic models, and these preference coefficients represent the weights of the corresponding acoustic models The weight value of each acoustic model is manually set in advance by a user according to his own requirements, or the weight value of each acoustic model is automatically determined by the terminal in advance by learning user preferences. Then, the terminal may perform weighted superposition on the respective acoustic models based on the weight value, so as to obtain a comprehensive acoustic model by fusion.

Specifically, after acquiring the user's voice preferences or needs, the terminal may select features of one or several dimensions with the highest user preferences or needs according to the voice characteristics and language style characteristics that the user likes, and select the features in multiple acoustic models. The sound is matched to determine the user's favorite coefficients for the sounds of different acoustic models. Finally, the sound characteristics of each acoustic model are combined with the corresponding favorite coefficients to obtain the final voice scene parameters.

For example, as shown in FIG. 27, the table shown in FIG. 27 exemplarily gives the sound characteristics corresponding to various sound types (thick, soft, funny). It can be seen that different sound types have corresponding sound characteristics. Speech speed, intonation, rhythm, and timbre are different. If the terminal obtains the user's preferences or needs for voice, the user can also directly match the sounds of multiple acoustic models according to the user's identity (that is, the user's preferences or needs are directly tied to the user's identity), thereby determining the user. The favorite coefficients for thick, soft, cute, funny, and other sound types are 0.2, 0.8, and 0.5 respectively, that is, the weights of these acoustic models are 0.2, 0.8, and 0.5, respectively. The speed, speed, intonation, The final acoustic model (that is, the fusion model) can be obtained by performing weighted superposition of prosody, timbre, and the like. The synthesized speech scene parameters realize the sound conversion of the acoustic model in terms of speech rate, intonation, rhythm, and timbre, which is conducive to producing mixed sound effects such as "talking Lin Zhiling" or "rapper mode Lin Zhiling".

The embodiment of the present invention is not limited to using the above-mentioned method to obtain a comprehensive model of multiple acoustic models (abbreviated as a fusion model). For example, in a possible embodiment, it is also possible to input character imitation data into the TTS parameter database based on the user's initiative or the user to the terminal. A voice request is made to form the final acoustic model. For example, in an application scenario, the terminal may provide a graphical user interface or a voice interaction interface, and the user may select parameters of each voice feature and parameters of language style features according to his preference, as shown in FIG. 28, FIG. 28 A selection interface of parameters of speech features and parameters of language style features is shown. In this selection page, the user selects the speech feature corresponding to the acoustic model of the "Lin Zhiling" sound, that is, the "speech rate, intonation, rhythm, tone color" in the speech feature corresponding to the "Lin Zhiling" type acoustic model. The parameter values of the sub-parameters are used as the parameter values of the sub-parameters such as "speech rate, intonation, prosody, and tone color" in the speech features corresponding to the fusion model. The user selects the language style feature as the language style feature corresponding to the acoustic model of the "Little Shenyang" sound, which is also the language style feature corresponding to the acoustic model of the "Little Shenyang" sound. The "mantra, response to specific scenes, wisdom type" , Character type, mixed dialect / popular language ”and other sub-parameter parameter values as the fusion style's corresponding language style features, language style features,“ mantra, response to specific scenes, wisdom type, personality type, mixed dialect / popular language ”, etc. The parameter value of the child parameter.

For example, the user may send a text or voice request to the terminal in advance "Please use Lin Zhiling's voice to speak in Xiao Shenyang's language style", then the terminal's voice response system resolves that the user's settings are intended to integrate the voice characteristics of the fusion model into the Speech speed, intonation, prosody, and timbre are set to the relevant sub-parameter values of the speech features of the acoustic model of the "Lin Zhiling" sound, and the mantra, response to specific scenes, type of intelligence, personality type, and The value of the relevant sub-parameters for the language style feature of the acoustic model with the dialect / popular language set to the sound of "Little Shenyang".

In addition, in a possible embodiment of the present invention, the terminal may also determine the acoustic model preferred by the user according to the identity of the user, so that the terminal can directly select all acoustic models from the acoustic model library during the sound synthesis process. Describe the user's favorite acoustic model.

Step 602: The terminal determines, through the input voice of the user, that the current dialogue needs to adopt an acoustic model of "character imitation."

In a specific embodiment, the terminal may determine, through the DM module, a scene in which the dialogue of the current dialogue needs to be set to "character imitation". Specifically, the manner in which the DM module determines the current scene as a voice scene of "character imitation" may include the following:

One way is that during the dialogue, the user's input voice contains the user's intention to clearly indicate that the current dialogue is "character imitation". After the DM module and the intent recognition module determine the user's intention, the current dialog is determined to be "character imitation" "Scene. For example, if the user inputs a voice to instruct the terminal to speak with Lin Zhiling's voice, then after the terminal recognizes the user's intention, the terminal automatically sets the current dialogue scene as a "character imitation" scene.

One way is that in a normal conversation, although the user has no clear intention to clearly indicate that the current conversation is "character imitation," the terminal can still determine whether the content of the input text corresponding to the user's input voice involves character imitation through the DM module. Content. In specific implementation, the DM module can determine the reply content that can be imitated by characters, such as full-text matching, keyword matching, and semantic similarity matching. These contents include lyrics, sound effects, movie lines, and animation dialogue scripts. Among them, the method of full-text matching means that the input text is exactly the same as a part of the corresponding movie or music work, the method of keyword matching means that the input text is the same as a part of keywords of a movie or music work, and the way of semantic similarity matching is Refers to the similarity between the input text and a part of the film or music works.

For example, the input text is "He has been the protagonist, he said that daydreaming is not wrong, and people without dreams are salted fish. On this way of fighting for dreams, I will gain something after working hard. That ’s enough. ”After matching the content in the above manner, it was found that“ the talent without dreams is a salted fish ”in the input text is the matching content, and the matching content is the line“ Shaolin Football ” There is no ideal, what is the difference with salted fish ", the voice is the voice of the character" Chou Xingchi ". Then, the current conversation is set as a "personal imitation" scene.

Step 603: The terminal obtains an acoustic model corresponding to "character imitation" from an acoustic model library.

In a specific embodiment of the present invention, the terminal may select a certain acoustic model or a certain fusion model from the acoustic model library according to user preference.

In another specific embodiment of the present invention, the terminal determines a sound mode identifier related to the content of the current input voice according to the content of the current input voice, and selects an acoustic model identifier corresponding to the sound mode identifier from the acoustic model library. Acoustic model. For example, the terminal may determine that the currently synthesized speech needs to use a "Chou Xingchi" type of sound according to the input text or user preference or reply text, and then select an acoustic model of the "Chou Xingchi" type of sound from the acoustic model library.

In another specific embodiment of the present invention, after the terminal selects a plurality of acoustic models in the acoustic model according to the identity of the user, the terminal determines a weight value (that is, a preference coefficient) of each acoustic model in the plurality of acoustic models; Wherein, the weight value of each acoustic model is set in advance by the user, or the weight value of each acoustic model is determined in advance according to the preference of the user; and then each acoustic model is based on the weight value The fusion is performed to obtain a fusion acoustic model.

Step 604: The terminal performs subsequent speech synthesis by using the selected acoustic model.

For example, if a general acoustic model is used for speech synthesis, when the input speech content issued by the user is "Where are you going to eat tonight?", The terminal may have originally synthesized speech as "Let's eat at XX tonight". In the "character imitation" scenario, the terminal uses the fusion model of the selected "Lin Zhiling" acoustic model and the "little Shenyang" acoustic model, and the final synthesized speech is "Do you know? Eat in XX place tonight, alas" . The speech features in the output speech use the relevant parameters of the "Lin Zhiling" acoustic model, thus reflecting the soft features of the synthesized speech. The language style features in the output speech use the relevant parameters of the "Little Shenyang" acoustic model, thus reflecting the witty and funny characteristics of the synthesized speech. In other words, the synthesized speech output in this way achieves the synthesis effect of "speaking in the language style of Shenyang with Lin Zhiling's voice".

It should be noted that the scenes such as "poetry recitation", "song humming", and "person imitation" enumerated in the above embodiments of the present invention may be used alone in the speech synthesis process, or they may be used comprehensively in the speech synthesis process. For example, for the combination of the "poetry recitation" voice scene and the "character imitation" voice scene, suppose the input text is "Read a five-character quatrain in Xiao Shenyang's language style using Lin Zhiling's voice", and the terminal selects the acoustic model library "Lin Zhiling" acoustic model and "Little Shenyang" acoustic model in the fusion model, and uses the TTS parameter library of "poetry recitation" voice scene parameters (that is, five-character quatrains corresponding to the rhythmic rhythm template) to the response text The final voice output after speech synthesis is "Then I read a poem for you," Climbing the Crane Tower ", do you know? As the days pass by the mountains, the Yellow River flows into the ocean, and you want to reach a new height. , 嚎 ~ ". In other words, during the synthesis of this output voice, the fusion model of "character imitation" as shown in Figure 28 can be used, and in some parts, "the day is full of mountains, the Yellow River flows into the sea, and the eyes are far away." A step up ”also uses a rhythmic rhythm template similar to that shown in FIG. 23, which not only completes the real-time voice interaction with the user, but also meets the user's personalized needs and improves the user experience.

In the specific embodiment of the present invention, after synthesizing speech, in order to enhance the expression effect of various TTS parameters, a background sound effect may also be superimposed when outputting the synthesized speech. In the following, the scenario of superimposing "background sound" on the synthesized speech is used as an example to describe the speech synthesis method of the embodiment of the present invention. Referring to FIG. 29, the method can be described by the following steps:

Step 701: The terminal presets a music library.

In a specific embodiment, a music library is preset in the TTS parameter library of the terminal, and the music library includes multiple music files, and these music files are used to provide a background sound effect during the speech synthesis process. The background sound effect is specifically Refers to a certain piece of music (such as pure music or songs) or sound effects (such as film and television sound effects, game sound effects, language sound effects, animation sound effects, etc.).

Step 702: The terminal determines that the reply text has content suitable for superimposing background music.

In a specific embodiment, the terminal may determine, through the DM module, content suitable for superimposing background music. The content suitable for superimposing background music can be words with emotional polarity, can be poetry, can be film and television lines, and so on. For example, the terminal can identify the sentiment-oriented words in the sentence through the DM module, and then determine the emotional state of the phrase, sentence, or the entire reply text through methods such as grammatical rule analysis and machine learning classification. In this process, the emotional dictionary can be used to identify these emotionally inclined words. The emotional dictionary is a collection of words, and the words in the collection have obvious emotional polarity tendencies, and the emotional dictionary also contains the polarity information of these words. For example, the words in the dictionary are identified with the following emotional polarity: happy, like, sadness, surprise, angry, fear, disgust, and other emotions Polarity type. In a possible embodiment, different types of emotional polarity can be further divided into multiple levels of emotional intensity (for example, divided into five levels of emotional intensity).

Step 703: The terminal determines a background sound effect to be superimposed from the music library.

In a specific embodiment, the terminal determines the background sound effect to be superimposed in the TTS parameter database through the PM module.

For example, the terminal sets the identification of the emotional polarity category for different segments (ie, sub-segments) of each music file in the music library in advance. For example, these segments are identified as the following emotional polarity types: happy, like , Sadness, surprise, anger, fear, disgust, etc. Assuming that the current response text includes texts with emotional polarity, after determining the emotional polarity categories of these texts in step 702, the terminal searches the music library through the PM module for a music file with the corresponding emotional polarity category identification. In a possible embodiment, if the type of emotional polarity can be further divided into multiple levels of emotional intensity, the emotional polarity category and the identification of the emotional intensity are set for each sub-segment in the music library in advance, then it is determined in step 702 After the emotional polarity category and emotional intensity of these texts, a sub-segment combination with the corresponding emotional polarity category and emotional intensity logo is found in the music library as the final selected background sound effect.

For example, if the current reply text includes the content of the poem / word / music, then the terminal searches the music library through the PM module for the pure music or song or music special effect related to the content of the poem / word / music. If it can be found, Use the pure song or song as the background sound effect to be superimposed. In addition, if the identification of the emotional polarity category is set for each background sound effect in the music library beforehand, after determining the emotional polarity category of the content of the poem / word / music included in the reply text, you can find in the music library The background sound effect identified by the corresponding emotional polarity category. In a possible embodiment, if the type of emotional polarity can be further divided into multiple levels of emotional intensity, an emotional polarity category and an identification of the emotional intensity are set in advance for each background sound effect in the music library. After the emotional polarity category and emotional intensity of the content of the poem / word / music, find the background sound effect with the corresponding emotional polarity category and emotional intensity logo in the music library.

For example, assuming that the current reply text includes the content of the poem "character imitation", then the terminal can search the music library through the PM module for pure music or songs or music special effects related to the acoustic model imitated by the character, such as the imitated character For the voice mode "Little Shenyang", then you can find songs related to the voice mode "Little Shenyang" in the music library (such as the song "My Name is Little Shenyang"). Further, you can follow the dialogue scene or the content of the reply text Select a song clip from the song as the final background sound.

Step 704: The terminal aligns the background sound effect determined by the reply text alignment to facilitate subsequent speech synthesis.

In a specific embodiment, the terminal can split the content in the reply text that needs to be superimposed with background sound effects into different parts (split or punctuated according to punctuation), and each part can be called a sub-content. Emotional polarity types and emotion intensity. Furthermore, after the background sound effect matched with the content is determined, the content is aligned with the matched background sound effect, that is, the emotional change of the content is basically consistent with the emotional change of the background sound effect.

For example, referring to FIG. 30, in an application scenario, the reply text is "The weather is good, the national football team has won again, so happy." The entire content of the reply text needs to be superimposed with background sound effects. The reply text is split into "Weather Yes, "The national football team won again," "Happiness" is a sub-content of the three parts, and the emotional polarity category of each part is happy, and the emotional intensity is 0.48, 0.60, 0.55 (from the figure (Indicated by black dots in the lower half), the total length of the pronunciation of each part is 0.3s, 0.5s, 0.2s. Through step 703, a music file whose emotional polarity category is happy has been initially determined. Further, the emotional change track of the music file can be calculated and counted to obtain the emotional intensity of each part of the music. The waveform shown in Figure 30 represents a piece of music. The music can be divided into 15 small fragments, each of which has a sound length of 0.1s. According to the parameters such as the sound intensity and rhythm of each small fragment, through fixed rules or classifiers Calculate the emotional intensity of each small segment. The emotional intensity of these 15 small segments are: 0.41,0.65, 0.53, 0.51,0.34, 0.40, 0.63, 0.43, 0.52, 0.33, 0.45, 0.53, 0.44, 0.42 0.41 (indicated by the black dot in the upper half of the figure). It can be seen that for the sub-segments composed of the fourth, fifth, and sixth segments, the total sound length is 0.3s, and the maximum emotional intensity is 0.51 (originated from the emotional intensity of the fourth segment 0.51); The sub-segments composed of 7, 8, 9, 10, and 11 segments have a total sound length of 0.5s, and the maximum emotional intensity is 0.63 (derived from the emotional intensity of the seventh segment 0.63); The sub-segment composed of small fragments has a total sound length of 0.2s, and the maximum emotional intensity is 0.53 (originated from the emotional intensity of the fourth fragment of 0.53). In other words, the emotional changes of the three sub-segments are basically consistent with the emotional changes of the sub-contents of the three parts of the reply text (for example, the change trajectories of the two polylines in the illustration are basically the same), so the three The music segment composed of the sub-segments is the background sound effect that matches the reply text. Therefore, you can align the three sub-segments of "the weather is good," "the national football team has won again," and "happy" in the complex text, so as to produce the "speech superimposed background sound" effect in the subsequent speech synthesis process.

The system framework, the terminal device, and the related speech synthesis method according to the embodiments of the present invention have been described in detail above. Based on the same inventive concept, the hardware devices according to the embodiments of the present invention are provided below.

Referring to FIG. 31, FIG. 31 is a schematic structural diagram of a speech synthesis device 200 according to an embodiment of the present invention. As shown in FIG. 31, the device 200 may include one or more processors 2011, one or more memories 2012, and audio circuits. 2013. In specific implementation, the device 200 may further include an input unit 2016, a display unit 2019, and other components. The processor 2011 may be connected to the memory 2012, the audio circuit 2013, the input unit 2016, and the display unit 2019 through a bus, respectively. They are described as follows:

The processor 2011 is a control center of the device 200, and uses various interfaces and lines to connect various components of the device 200. In a possible embodiment, the processor 2011 may further include one or more processing cores. The processor 2011 may perform speech synthesis by running or executing software programs (instructions) and / or modules stored in the memory 2012, and calling data stored in the memory 2012 (such as executing each of the embodiments in FIG. 4 or FIG. 9). Functions of this module and processing data) to facilitate real-time voice conversation between the device 200 and the user.

The memory 2012 may include a high-speed random access memory, and may further include a non-volatile memory, such as at least one magnetic disk storage device, a flash memory device, or other volatile solid-state storage device. Accordingly, the memory 2012 may further include a memory controller to provide the processor 2011 and the input unit 2017 with access to the memory 2012. The memory 2012 may be specifically used to store software programs (instructions) and data (relevant data in the acoustic model library, relevant data in the TTS parameter library).

The audio circuit 2013 may provide an audio interface between the device 200 and a user, and the audio circuit 2013 may further be connected with a speaker 2014 and a microphone 2015. On the one hand, the microphone 2015 can collect the user's sound signals, and convert the collected sound signals into electrical signals, which are received by the audio circuit 2013 and converted into audio data (that is, forming the user's input voice), and then the audio data is transmitted to the processor. 2011 performs voice processing. On the other hand, the processor 2011 synthesizes the reply voice based on the user's input voice and transmits it to the audio circuit 2013. The audio circuit 2013 can convert the received audio data (that is, the reply voice) into an electrical signal. It is further transmitted to the speaker 2014, which is converted into a sound signal output by the speaker 2014, so that the reply voice is presented to the user, thereby achieving the purpose of real-time voice conversation between the device 200 and the user.

The input unit 2016 may be used to receive digital or character information input by a user, and generate a keyboard, mouse, joystick, optical or trackball signal input related to user settings and function control. Specifically, the input unit 2017 may include a touch-sensitive surface 2017 and other input devices 2018. The touch-sensitive surface 2017 is also referred to as a touch display screen or a touchpad, which can collect user's touch operations on or near it and drive the corresponding connection device according to a preset program. Specifically, other input devices 2018 may include, but are not limited to, one or more of a physical keyboard, function keys, trackball, mouse, joystick, and the like.

The display unit 2019 can be used to display information input by the user or information provided by the device 200 to the user (such as the relevant identification or text for replying to speech) and various graphical user interfaces of the device 200. These graphical user interfaces can be composed of graphics, text, and icons. , Video, and any combination thereof. Specifically, the display unit 2019 may include a display panel 2020. Optionally, the display panel 2020 may be configured by using a liquid crystal display (Liquid Crystal Display, LCD), an organic light emitting diode (Organic Light-Emitting Diode, OLED), and the like. Although in FIG. 31, the touch-sensitive surface 2017 and the display panel 2020 are two separate components, in some embodiments, the touch-sensitive surface 2017 and the display panel 2020 may be integrated to implement input and output functions. For example, the touch-sensitive surface 2017 may cover the display panel 2020. When the touch-sensitive surface 2017 detects a touch operation on or near the touch-sensitive surface 2017, it is transmitted to the processor 2011 to determine the type of the touch event, and the processor 2011 then A corresponding visual output is provided on the display panel 2020.

Those skilled in the art can understand that the device 200 in the embodiment of the present invention may include more or fewer components than shown, or some components may be combined, or different components may be arranged. For example, the device 200 may further include a communication module, a camera, and the like, and details are not described herein again.

Specifically, the processor 2011 may implement the speech synthesis method of the embodiment of the present invention by running or executing a software program (instruction) stored in the memory 2012 and calling data stored in the memory 2012, including: the processor 2011 according to a user The current input voice to determine the identity of the user; obtain an acoustic model from the acoustic model library according to the current input voice, and the preset information of the acoustic model includes a preset sound speed, a preset volume, a preset pitch, Two or more of a preset tone color, a preset intonation, and a preset prosody rhythm; determining basic speech synthesis information from the speech synthesis parameter database according to the identity of the user, the basic speech synthesis information including the The amount of change in one or more of the preset sound speed, the preset volume, and the preset pitch; determining a reply text based on the current input voice; and synthesizing a parameter database from the speech based on the reply text and context information Determining enhanced speech synthesis information including the preset tone color, the preset intonation, and the preset rhyme One or more of the amount of change of rhythm; through the acoustic model of the speech synthesis, a speech synthesis information base and the reinforcing speech synthesis information according to the text of the reply.

For the specific implementation process of the speech synthesis method performed by the processor 2011, reference may be made to the foregoing method embodiments, and details are not described herein again.

It should be noted that, in a possible implementation manner, when the modules in the embodiments of FIG. 4 or FIG. 9 are software modules, the memory 2012 may further be used to store these software modules, and the processor 2011 may be used for the software programs in the memory 2012 (Instructions) and / or these software modules, and calling data stored in the memory 2012 to perform speech synthesis.

It should also be noted that although FIG. 31 is only an implementation manner of the speech synthesis device of the present invention, the processor 2011 and the memory 2012 in the device 200 may be integratedly deployed in a possible embodiment.

In the above embodiments, it may be implemented in whole or in part by software, hardware, firmware, or any combination. When implemented in software, it may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When the computer program instructions are loaded and executed on a computer, the processes or functions according to the embodiments of the present invention are wholly or partially generated. The computer may be a general-purpose computer, a special-purpose computer, a computer network, or other programmable device. The computer instructions may be stored in a computer-readable storage medium, or transmitted from one computer-readable storage medium to another computer-readable storage medium. For example, the computer instructions may be from a network site, computer, server, or data center. Transmission to another network site, computer, server or data center via wired (such as coaxial cable, optical fiber, digital subscriber line) or wireless (such as infrared, microwave, etc.). The computer-readable storage medium may be any available medium that can be accessed by a computer, and may also be a data storage device such as a server, a data center, and the like that includes one or more available medium integration. The usable medium may be a magnetic medium (such as a floppy disk, a hard disk, a magnetic tape, etc.), an optical medium (such as a DVD, etc.), or a semiconductor medium (such as a solid state hard disk), and the like.

In the above embodiments, the description of each embodiment has its own emphasis. For a part that is not described in detail in an embodiment, reference may be made to related descriptions in other embodiments.

Claims

A speech synthesis method, characterized in that the method includes:

Determining the identity of the user according to the user's current input voice;

Obtaining an acoustic model from a preset acoustic model library according to the current input voice, the preset information of the acoustic model includes a preset sound speed, a preset volume, a preset pitch, a preset tone color, a preset tone, and a preset Multiple in prosody

Determining basic speech synthesis information according to the identity of the user, the basic speech synthesis information including a change amount of one or more of the preset sound speed, the preset volume, and the preset pitch;

Determining a reply text according to the current input voice;

Determining enhanced speech synthesis information according to the response text and context information of the currently input speech, the enhanced speech synthesis information including one or more of the preset tone color, the preset tone, and the preset prosody Amount of change

According to the acoustic model, speech synthesis is performed on the reply text according to the basic speech synthesis information and the enhanced speech synthesis information.
The method according to claim 1, wherein determining the enhanced speech synthesis information according to the reply text and context information comprises:

Determine the literary style characteristics of the reply text according to the reply text, the literary style features including one or more of the number of sentences, the number of words per sentence, and the arrangement order of the number of words in the reply text A

Selecting a corresponding change amount of the preset rhythm according to the literary style feature of the reply text; wherein there is a corresponding relationship between the literary style feature and the change amount of the preset prosody rhythm, The amount of change indicates the respective changes in the reading duration, reading pause position, reading pause time, and accent of characters in part or all of the reply text.
The method according to claim 1 or 2, wherein the preset information of the selected acoustic model further includes a language style feature, and the language style feature specifically includes a mantra, a response mode to a specific scene, and a type of wisdom , One or more of personality type, mixed popular language or dialect, or appellation of a particular person.
The method according to any one of claims 1 to 3, wherein there are multiple acoustic models in the acoustic model library; and obtaining an acoustic model from a preset acoustic model library according to the current input voice ,include:

Determining preferences of the user according to the identity of the user;

An acoustic model is selected from the acoustic model library according to the user's preference.
The method according to any one of claims 1-3, wherein there are a plurality of acoustic models in the acoustic model library, and each acoustic model has an acoustic mode identifier respectively; and Acoustic models obtained from preset acoustic model libraries include:

Determining a sound mode identifier related to the content of the currently inputted voice according to the content of the currently inputted voice;

An acoustic model corresponding to the acoustic mode identification is selected from the acoustic model library.
The method according to any one of claims 1 to 3, wherein there are multiple acoustic models in the acoustic model library;

The obtaining an acoustic model from a preset acoustic model library according to the current input voice includes:

Selecting multiple acoustic models in the acoustic model according to the identity of the user;

Determining a weight value of each of the plurality of acoustic models; wherein the weight value of each of the acoustic models is preset by a user, or the weight value of each of the acoustic models is based on a preference of the user in advance Determined

The acoustic models are fused based on the weight values to obtain a fused acoustic model.
The method according to any one of claims 1-6, wherein before determining the identity of the user according to the user's current input voice, further comprising:

Determining the correspondence between the target character and the user's preferred pronunciation according to the user's historical input voice, and associating the correspondence between the target character and the user's preferred pronunciation with the identity of the user;

Accordingly, the using the acoustic model to perform speech synthesis on the reply text according to the basic speech synthesis information and the enhanced speech synthesis information includes:

When the target character associated with the identity of the user is present in the reply text, the acoustic model is used according to the correspondence between the target character and the user's preferred pronunciation, the basic speech synthesis information, and the The enhanced speech synthesis information performs speech synthesis on the reply text.
The method according to any one of claims 1 to 7, further comprising: selecting a background sound effect from a preset music library according to the reply text, the background sound effect being music or a sound special effect;

Accordingly, the using the acoustic model to perform speech synthesis on the reply text according to the basic speech synthesis information and the enhanced speech synthesis information includes:

According to the acoustic model, speech synthesis is performed on the reply text according to the background sound effect, the basic speech synthesis information, and the enhanced speech synthesis information.
The method according to claim 8, wherein the background sound effect has one or more identifications of an emotional polarity type and an identification of an emotional intensity; the identifications of the emotional polarity type are used to indicate at least one of the following emotions : Happiness, like, sadness, surprise, anger, fear, disgust; the identifier of the emotion intensity is used to indicate the respective degree value of the at least one emotion;

The selecting a background sound effect from a preset music library according to the reply text includes:

Split the content of the reply text into multiple sub-contents, and determine the emotional polarity type and emotional intensity of each sub-content;

Selecting the most suitable background sound effect from the preset music library according to the emotional polarity type and the emotional intensity of each sub-content;

The best-matching background sound effect includes multiple sub-segments, each of which has an identification of an emotional polarity type and an identification of an emotional intensity, and the emotional polarity indicated by the identification of the emotional polarity type of each sub-segment The type is the same as the emotional polarity type of each sub-content, and the change between the emotional intensity indicated by the identifier of the emotional intensity of each sub-segment and the change between the emotional intensity of each sub-content The trends are consistent.
A speech synthesis device, characterized in that the speech synthesis device includes:

A voice recognition module for receiving a user's current input voice;

A voice dialogue module, configured to determine the identity of the user based on the user's current input voice; determine the basic speech synthesis information based on the user's identity; determine the reply text based on the current input voice; based on the reply text, the current The context information of the input speech determines the enhanced speech synthesis information;

A speech synthesis module, configured to obtain an acoustic model from a preset acoustic model library according to the current input voice, the preset information of the acoustic model includes a preset sound speed, a preset volume, a preset pitch, a preset tone color, A plurality of preset tones and preset prosody; performing speech synthesis on the reply text according to the basic speech synthesis information and the enhanced speech synthesis information through the acoustic model;

Wherein, the basic speech synthesis information includes a change amount of one or more of the preset sound speed, the preset volume, and the preset pitch of the preset information of the acoustic model; the enhanced speech synthesis The information includes a change amount of one or more of the preset tone color, the preset tone, and the preset prosody rhythm of the preset information of the acoustic model.
The device according to claim 10, wherein the voice dialog module is specifically configured to:

Determining the literary style characteristics of the reply text according to the reply text, the literary style features including one or more of the number of sentences, the number of words per sentence, and the order of the number of sentences in the reply text Multiple

Selecting a corresponding change amount of the preset rhythm according to the literary style feature of the reply text; wherein there is a corresponding relationship between the literary style feature and the change amount of the preset prosody rhythm, The amount of change indicates the respective changes in the reading duration, reading pause position, reading pause time, and accent of characters in part or all of the reply text.
The device according to claim 10 or 11, wherein the preset information of the selected acoustic model further includes a language style feature, and the language style feature specifically includes a mantra, a response mode to a specific scene, and a type of wisdom , One or more of personality type, mixed popular language or dialect, or appellation of a particular person.
The device according to any one of claims 10-12, wherein there are multiple acoustic models in the acoustic model library; and the speech synthesis module is specifically configured to:

The preference of the user is determined according to the identity of the user; an acoustic model is selected from the acoustic model library according to the preference of the user.
The device according to any one of claims 10-12, wherein there are multiple acoustic models in the acoustic model library, and each acoustic model has an acoustic mode identifier respectively; and the voice synthesis module is specifically configured to: :

Determining an acoustic mode identifier related to the content of the current input voice according to the content of the current input voice; and selecting an acoustic model corresponding to the acoustic mode identifier from the acoustic model library.
The device according to any one of claims 10-12, wherein there are multiple acoustic models in the acoustic model library; and the speech synthesis module is specifically configured to:

Selecting a plurality of acoustic models in the acoustic model according to the identity of the user; determining a weight value of each acoustic model in the plurality of acoustic models; wherein the weight value of each acoustic model is preset by the user, Alternatively, the weight value of each acoustic model is determined in advance according to the preference of the user; the respective acoustic models are fused based on the weight values to obtain a fused acoustic model.
The device according to any one of claims 10-15, characterized in that:

The voice dialogue module is further configured to: before the voice recognition module receives the user's current input voice, determine the correspondence between the target character and the user's preferred pronunciation according to the user's historical input voice, and compare the target character with the user's preferred pronunciation. The correspondence between the user's preferred pronunciations is associated with the identity of the user;

The speech synthesis module is specifically configured to: when the target character associated with the identity of the user exists in the reply text, according to the correspondence between the target character and the user's preferred pronunciation through the acoustic model , The basic speech synthesis information and the enhanced speech synthesis information perform speech synthesis on the reply text.
The device according to any one of claims 10 to 16, characterized in that:

The voice dialogue module is further configured to select a background sound effect from a preset music library according to the reply text, where the background sound effect is music or a sound special effect;

The speech synthesis module is specifically configured to perform speech synthesis on the reply text according to the background sound effect, the basic speech synthesis information, and the enhanced speech synthesis information through the acoustic model.
The device according to claim 17, wherein the background sound effect has one or more identifications of an emotional polarity type and an identification of an emotional intensity; the identifications of the emotional polarity type are used to indicate at least one of the following emotions : Happiness, like, sadness, surprise, anger, fear, disgust; the identifier of the emotion intensity is used to indicate the respective degree value of the at least one emotion;

The voice dialog module is specifically configured to: split the content of the reply text into multiple sub-contents, and determine the emotional polarity type and emotional intensity of each sub-content respectively; and according to the emotional polarity types and emotions of each sub-content Intensity, select the most matching background sound effect from the preset music library;

The best-matching background sound effect includes multiple sub-segments, each of which has an identification of an emotional polarity type and an identification of an emotional intensity, and the emotional polarity indicated by the identification of the emotional polarity type of each sub-segment The type is the same as the emotional polarity type of each sub-content, and the change between the emotional intensity indicated by the identifier of the emotional intensity of each sub-segment and the change between the emotional intensity of each sub-content The trends are consistent.