WO2020024582A1 - Speech synthesis method and related device - Google Patents

Speech synthesis method and related device Download PDF

Info

Publication number
WO2020024582A1
WO2020024582A1 PCT/CN2019/076552 CN2019076552W WO2020024582A1 WO 2020024582 A1 WO2020024582 A1 WO 2020024582A1 CN 2019076552 W CN2019076552 W CN 2019076552W WO 2020024582 A1 WO2020024582 A1 WO 2020024582A1
Authority
WO
WIPO (PCT)
Prior art keywords
user
preset
acoustic model
speech synthesis
voice
Prior art date
Application number
PCT/CN2019/076552
Other languages
French (fr)
Chinese (zh)
Inventor
包飞
邓利群
孙文华
曾毓珑
魏建生
胡月志
黄茂胜
黄雪妍
李志刚
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Publication of WO2020024582A1 publication Critical patent/WO2020024582A1/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/04Training, enrolment or model building
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/22Interactive procedures; Man-machine interfaces

Definitions

  • the invention relates to the field of speech processing, in particular to a speech synthesis method and related equipment.
  • human-computer dialogue has begun to widely enter people's daily life. Common scenarios include intelligent customer service robots, smart speakers, chat robots, and so on.
  • the core of human-computer dialogue is that the machine can automatically understand and analyze the voice input by the user based on the data of the training or learning under the framework of the built system, and give a meaningful voice response.
  • a speech synthesis system for Chinese text if you just match the input text with the pronunciation database one by one, and connect the pronunciation of all characters in series to form a speech output, then such speech will be mechanically stiff, without inflections. The hearing experience is poor.
  • the TTS (text-to-speed) engine developed in recent years is a speech synthesis technology based on reading rules. Using the TTS engine for speech synthesis can handle the natural transition of single words / words and the transition of mood. , Making the machine answer the voice closer to the human voice.
  • the machine is limited to "speaking like a human” in the process of human-computer interaction, and does not consider the diverse needs of users for human-computer interaction.
  • the embodiments of the present invention provide a speech synthesis method and related equipment, so that the machine can provide a personalized speech synthesis effect for the user according to user preferences or the requirements of the dialogue environment during the human-machine interaction process, improve the timeliness of the human-machine dialogue, and increase the user. Voice interaction experience.
  • an embodiment of the present invention provides a speech synthesis method that can be applied to a terminal device, including: the terminal device receives a user's current input voice, and determines the identity of the user according to the user's current input voice;
  • the current input voice is obtained from an acoustic model library preset in the terminal device.
  • the preset information of the acoustic model includes a preset sound speed, a preset volume, a preset pitch, a preset tone color, and a preset. Two or more of the intonation and the preset rhythm; the terminal device determines basic speech synthesis information according to the identity of the user, and the identity of the user is associated with corresponding basic speech synthesis information.
  • the speech synthesis information may also be referred to as a basic TTS parameter.
  • the basic TTS parameter is used to characterize one or more of a preset sound velocity, a preset volume, and the pitch of an acoustic model used in speech synthesis.
  • the current input voice determines the reply text; the terminal device according to the reply text, or the reply text and context information Determine the enhanced speech synthesis information.
  • the enhanced speech synthesis information described in the embodiments of the present invention may also be referred to as enhanced TTS parameters.
  • the enhanced TTS parameters are used to characterize preset tones, preset tones, and Preset the amount of change of one or more of the prosody; in the embodiment of the present invention, the terminal device can determine the dialogue scene of the current dialogue according to the reply text, or according to the reply text and the context information of the currently input voice. ; The terminal device, through the acoustic model (including preset information of the acoustic model), performs speech synthesis on the reply text according to the basic speech synthesis information and the enhanced speech synthesis information, and obtains the Reply to the voice, thus realizing the real-time dialog interaction between the terminal device and the user. That is, in the embodiment of the present invention, the acoustic model can convert the reply text into the reply voice according to the preset information of the acoustic model and the change information of the preset information.
  • the acoustic model library may include multiple acoustic models (for example, a general acoustic model, a personalized acoustic model, etc.). These acoustic models are all neural network models, and these neural network models can be trained in advance from different corpora.
  • each acoustic model has its own preset information, that is, each acoustic model is bound to a specific preset information, and these preset information can be used as the basic input of the acoustic model. information.
  • the terminal may also determine the basic speech synthesis information according to the personal preference of the user.
  • the context information may represent a context of a current input voice or a historical input voice before the current input voice.
  • the terminal device in the human-machine voice interaction between the user and the terminal device, the terminal device generates a corresponding reply text based on the user's input voice on the one hand, and can respond to the interactive text based on the dialogue on the other hand And dialog context information, select personalized TTS parameters (TTS parameters include basic TTS parameters and enhanced TTS parameters) based on the current user's identity, preferences, and dialog scenarios, and the terminal device can then use the selected TTS parameters to pass the selected
  • TTS parameters include basic TTS parameters and enhanced TTS parameters
  • the acoustic model is used to generate a reply speech in a specific style, so as to present a personalized speech synthesis effect to the user, greatly improving the user's voice interaction experience with the terminal, and improving the timeliness of human-machine dialogue.
  • the terminal device also allows the user to tune the terminal device in real time through voice, and update the TTS parameters associated with the identity and preferences of the user, including updating the basic TTS parameters and strengthening the TTS parameters, so that the tuned out The terminal is closer to the user's interaction preferences and maximizes the user interaction experience.
  • the enhanced TTS parameters may be further classified into speech emotion parameters and speech scene parameters.
  • the speech emotion parameters are used to make the speech synthesized by the acoustic model present specific emotional characteristics. According to the different emotional characteristics, the speech emotion parameters can be further classified into neutral emotion, mild happiness, moderate happiness, extreme happiness, light Sadness, moderate sadness and other parameters.
  • the speech scene parameters are used to make the speech synthesized by the acoustic model present specific scene characteristics.
  • the speech scene parameters can be further divided into daily conversation, poetry recitation, song humming, storytelling, News broadcast and other parameters, that is to say, the use of these voice scene parameters in speech synthesis will enable the synthesized speech to show the sound effects of daily dialogue, poetry recitation, song humming, storytelling, news broadcast and other voice scenes.
  • the manner of determining the current scene as a voice scene of "poem recitation" may include:
  • the user's input voice contains the user's intention to clearly indicate that the current dialogue is a "poem recitation" voice scene;
  • the terminal device can still determine whether the content of the reply text involves a particular literary style such as poetry, words, tunes, fu, etc.
  • a particular literary style such as poetry, words, tunes, fu, etc.
  • One or more types such as five-character quatrains or seven-character quatrains or rhythmic poems, or specific word or song cards;
  • the terminal device stores in advance literary style characteristics such as the number of words, the number of sentences, and the order of the number of words in each sentence, and analyzes the punctuation (pause), number of words, and number of sentences in the reply text. And the sequence of the number of words in each sentence, match a paragraph or all of the text in the reply text with the pre-stored literary style feature. If the match is successful, the paragraph or all text that meets the pre-stored literary style feature can be used as Text using a "poem recitation" voice scene.
  • the speech scene of "Poetry Recitation” focuses on the rhythm of speech, and the speech scene parameters of "Poetry Recitation” are used to adjust the speech pause position / pause time of input text that conforms to a specific literary style (or syntax format) That is, the word segmentation of the text content), the length of the single word or word reading, and the position of the accent, so as to strengthen the rhythm.
  • a specific literary style or syntax format
  • the enhanced rhythmic rhythm has a clearer and stronger emotional expression. For example, when reading specific poems, nursery rhymes, and other specific syntactic formats, the enhanced rhythmic rhythm can produce "Suppression and frustration" feeling.
  • the speech scene parameters of "Poetry Recitation” can be realized by a rhythmic rhythm template, and for each specific literary style of text content, it can correspond to a rhythmic rhythm template.
  • the literary style characterizes the genre of poetry, for example, the literary style is ancient poetry, near-style poetry (such as five-character quatrains, seven-character quatrains), rhythm poetry (such as five-character verses, seven-character verses), and words (such as small orders, middle words, Long words), tunes (including various tunes, song cards, etc.), for each rhythmic rhythm template, it defines the volume change of the word at each position in the template (that is, the weight of the word) and the change of the sound length ( That is, the length of time the word is pronounced), the pause position / pause time of the speech in the text (that is, the word segmentation of the text content), and so on.
  • the process of the terminal determining the enhanced speech synthesis information according to the reply text and context information specifically includes:
  • the literary style feature of the reply text is determined by analyzing the reply text, and the literary style feature includes one of the number of sentences, the number of words per sentence, and the arrangement order of the number of words in the reply text. Or a plurality of; selecting a corresponding change amount of the preset rhythm according to the literary style characteristics involved in the reply text.
  • the change amount of the preset prosody rhythm is the prosody rhythm template, and there is a corresponding relationship between the literary style feature and the prosody rhythm template.
  • the terminal performs rhythmic template alignment on the content of the reply text, so as to facilitate subsequent speech synthesis.
  • the terminal may align the relevant content in the reply text with the rhythmic template of the "poem recitation" voice scene.
  • the terminal may combine the pronunciation of the corresponding acoustic model library with the relevant content in the reply text and the parameters of the prosodic rhythm template, and refer to a certain scale to superimpose the parameters of the prosodic rhythm template into these pronunciation segments.
  • the prosody enhancement parameter is ⁇ (0 ⁇ ⁇ 1)
  • the preset volume of the i-th word in the text content is Vi. If the prosodic rhythm feature of the word includes the accent feature , Whose re-reading change is E1, then the final volume of the word is Vi ⁇ (1 + E1) ⁇ (1 + ⁇ ).
  • the basic sound length of the i-th word in the text is Di and the change amount of the sound length is E2, then the final sound length of the character is Di ⁇ (1 + E2).
  • a pause is required between the i-th word and the i + 1-th word, and the pause time is changed from 0s to 0.02s.
  • the acoustic model library may include a general acoustic model and several personalized acoustic models, where:
  • the preset information of the general acoustic model may include the preset sound speed, preset volume, preset pitch, preset tone, preset tone, preset rhythm, etc. of the model.
  • the speech synthesized by the general acoustic model is normal, Sound effects in general dialogue scenarios.
  • the preset information of the personalized acoustic model may include voice characteristics and language style characteristics. That is, the preset information of the personalized acoustic model includes two or more of the preset sound speed, preset volume, preset pitch, preset tone, preset tone, and preset rhythm. In addition, it can also include other personalized information, such as one or more of language style characteristics including mantras, response methods to specific scenes, wisdom types, personality types, mixed popular languages or dialects, and titles of specific characters. Each.
  • the speech synthesized by the personalized acoustic model can "simulate" the sound effect of the dialogue scene.
  • the preset sound speed, preset volume, preset pitch, preset tone, preset tone, preset rhythm, and other preset information of different acoustic models also differ.
  • personalized acoustics The preset information of the model may be significantly different from the preset information of the general acoustic model.
  • character imitation as an example to describe the implementation of an acoustic model related to "character imitation” in speech synthesis.
  • the terminal device may determine, through user input voice, that the current conversation needs to adopt an acoustic model of "character imitation", which specifically includes several methods:
  • the terminal device determines the user's intention, it further determines that the current conversation is "character imitation". For example, if the user inputs a voice to instruct the terminal to speak with Lin Zhiling's voice, then after the terminal recognizes the user's intention, the terminal automatically sets the current dialogue scene as a "character imitation" scene.
  • the terminal device can still determine whether the content of the input text corresponding to the user's input voice involves the content of character imitation.
  • the content of the reply that can be imitated by characters can be determined by means of full-text matching, keyword matching, and semantic similarity matching. These contents include lyrics, sound effects, movie lines and cartoon dialogue scripts.
  • the acoustic model library of the terminal device is preset with various acoustic models (that is, personalized acoustic models) for implementing "character imitation".
  • the acoustic model of "character imitation” can be used to make the synthesized speech have the sound characteristics of a specific character. Therefore, the information of the preset voice, preset tone, and preset rhythm of the acoustic model of "character imitation” and this information of the general acoustic model will be There are differences.
  • the character imitated by the "personal imitation” acoustic model may be the user's favorite personal image, a character in a film or television work, or a combination of multiple preset sound modes and user preferences.
  • the acoustic model of “imitation” can be an acoustic model in which the user imitates the user's own speaking style; it can also be an acoustic model that mimics the speaking characteristics of other characters, for example, an acoustic model used to imitate “Lin Zhiling / Soft Voice”, which can imitate “Little Shenyang /
  • the acoustic model of "funny sound” can be an acoustic model that mimics "Andy Lau / Deep Voice", and so on.
  • the terminal does not select a specific acoustic model in the acoustic model library, but a comprehensive model (also referred to as a fusion model) of multiple acoustic models in the acoustic model library. ).
  • the terminal may obtain the acoustic model corresponding to "character imitation" from the acoustic model library, and the implementation methods may include the following:
  • the terminal device may select a certain acoustic model or a certain fusion model from the acoustic model library according to the identity of the user. Specifically, since the identity of the user may be associated with the preference of the user, the terminal device may determine the preference of the user according to the identity of the user, and then select a certain acoustic model or a certain fusion model from the acoustic model library according to the preference of the user.
  • the acoustic model preferred by the user is not necessarily the personalized acoustic model originally set in the acoustic model library, but may be an acoustic model obtained by fine-tuning parameters of a personalized acoustic model according to the preference of the user.
  • the sound characteristics of a personalized acoustic model originally set in the acoustic model library include a first speech speed (speed of sound), a first intonation, a first rhythm, and a first tone color.
  • the terminal determines the user's favorite various parameter combinations through analysis of user preferences or manual settings by the user: 0.8 times the first speech speed, 1.3 times the first intonation, 0.9 times the first rhythm, and 1.2 times the first feminization Tone color, so that these parameters are adjusted accordingly to obtain a personalized acoustic model that meets user needs.
  • the terminal device determines an acoustic mode identifier related to the content of the current input voice according to the content of the currently input voice, and selects an acoustic model corresponding to the acoustic mode identifier from the acoustic model library. For example, the terminal may determine that the currently synthesized speech needs to use a "Chou Xingchi" type of sound according to the input text or user preference or reply text, and then select an acoustic model of the "Chou Xingchi" type of sound from the acoustic model library.
  • the terminal device After the terminal device selects multiple acoustic models in the acoustic model according to the identity of the user, determines a weight value (that is, a preference coefficient) of each acoustic model in the multiple acoustic models;
  • the weight values of the acoustic models are set in advance by the user, or the weight values of the respective acoustic models are determined in advance according to the preferences of the user; and then the respective acoustic models are fused based on the weight values to obtain a fusion Acoustic model after.
  • the terminal device can also directly match the sounds of multiple acoustic models according to the user's identity (that is, the user's preferences or needs are directly tied to the user's identity), thereby Determine the user's favorite coefficients for thick, soft, cute, funny, and other sound types are 0.2, 0.8, and 0.5, that is, the weights of these acoustic models are 0.2, 0.8, and 0.5, respectively.
  • the final acoustic model ie, fusion model
  • the synthesized speech scene parameters realize the sound conversion of the acoustic model in terms of speech rate, intonation, rhythm, and timbre, which is conducive to producing mixed sound effects such as "talking Lin Zhiling" or "rapper mode Lin Zhiling".
  • the TTS parameter further includes a correspondence between a target character and a user's preferred pronunciation.
  • the customized character pronunciation table includes a mapping relationship between a target character and a user's preferred pronunciation.
  • the mapping relationship between the target character and the user's preferred pronunciation is used to enable the target character involved in the speech synthesized by the acoustic model to have the user's preferred pronunciation.
  • the mapping relationship between the target character and the user's preferred pronunciation is associated with the identity of the user, that is, different mapping relationships can be organized according to the identity of the user.
  • the customized character pronunciation table can be organized and stored according to user identity.
  • the customized character pronunciation table corresponding to an unregistered user is empty, and the customized character pronunciation table corresponding to a registered user can be added based on the user's preference. Change, delete, etc.
  • the object of the setting operation may be a character, a person / place name, a letter, a special symbol, etc. that are easily misunderstood by the terminal or the user likes.
  • the customized character pronunciation table includes the mapping relationship between the target character (string) and the user's preferred pronunciation.
  • the target character (string) can be a character (Chinese character or foreign language), a word, a phrase, a sentence, or a number or a symbol (such as a Chinese character). , Foreign characters, emoticons, punctuation, special symbols ...) and more.
  • the terminal device may determine the correspondence between the target character and the user's preferred pronunciation according to the historical input voice of the user, associate the correspondence between the target character and the user's preferred pronunciation with the identity of the user, and write Enter the custom character pronunciation table.
  • the original acoustic model of the terminal reads “xiao3, zhu1, pei4, and qi2”. If the user tunes the terminal device through voice in advance, it is requested to set the “odd” pronunciation in the phrase “Pigpage”. If it is "ki1", the terminal device will record the "small pig page” and "xiao3" as a mapping relationship, and write the mapping relationship into the custom character pronunciation table associated with "xiaoming”.
  • the terminal device may find the dialogue text output by the terminal in the last round or previous rounds of conversation in the context information, and determine the pronunciation of each word in the dialogue text (such as using an acoustic model to determine the pronunciation). For example, the output text of the terminal in the last round of conversation was "I'm glad to meet you, Xiao Qian", and the terminal determined that its corresponding pronunciation is "hen3, gao1, xing4, ren4, shi2, ni3, xiao3xi1". In this way, the DM module matches the misreading pronunciation with the pronunciation string of the output text to determine that the Chinese word corresponding to the misreading pronunciation "xiao3xiaoxi1" is " ⁇ ", that is, " ⁇ " is Target term (the target character to be corrected). Furthermore, the terminal device adds the target word “ ⁇ ” and the target pronunciation “xiao3 qian4” as new target character-phonetic pairs to a custom character pronunciation table associated with the current user identity.
  • the terminal device when the terminal device finds that the target character is associated with the identity of the user in the reply text, it uses the acoustic model to read the target character according to the target character and the user's preference.
  • the correspondence relationship among them, the basic speech synthesis information and the enhanced speech synthesis information perform speech synthesis on the reply text.
  • the terminal device when the response text of the terminal device contains "Xiao Qian", the terminal device will determine that the pronunciation of "Xiao Qian" is "xiao3 qian4" according to the record of the customized character pronunciation table. In this way, the pronunciation of "Xiao Qian" in the reply speech obtained by speech synthesis through the acoustic model is "xiao3 qian4".
  • the TTS parameter further includes a background sound effect
  • the TTS parameter database may include a music library, the music library includes multiple music information, and the music information is used in a speech synthesis process.
  • the background sound effect specifically refers to a certain music segment (such as pure music or song) or sound special effects (such as movie sound effects, game sound effects, language sound effects, animation sound effects, etc.) in the music.
  • the background sound effect is used to superimpose different styles and rhythms of music or sound effects on the speech background synthesized by the acoustic model, thereby enhancing the expression effect of the synthesized speech (such as enhancing the emotional effect).
  • the following describes a method for synthesizing speech in an embodiment of the present invention by using a scene in which a synthesized speech is superimposed with a "background sound effect" as an example.
  • the terminal device when the terminal device determines that the reply text has content suitable for superimposing background music, it is only necessary to superimpose a background sound effect on the synthesized speech.
  • the terminal device may automatically determine content suitable for superimposing background music.
  • the content suitable for superimposing background music can be words with emotional polarity, can be poetry, can be film and television lines, and so on.
  • the terminal can identify the sentiment-oriented words in the sentence through the DM module, and then determine the emotional state of the phrase, sentence, or the entire reply text through methods such as grammatical rule analysis and machine learning classification. In this process, the emotional dictionary can be used to identify these emotionally inclined words.
  • the emotional dictionary is a collection of words, and the words in the collection have obvious emotional polarity tendencies, and the emotional dictionary also contains the polarity information of these words.
  • the words in the dictionary are identified with the following emotional polarity: happy, like, sadness, surprise, angry, fear, disgust, and other emotions Polarity type.
  • different types of emotional polarity can be further divided into multiple levels of emotional intensity (for example, divided into five levels of emotional intensity).
  • the terminal After determining that there is content suitable for superimposing a background sound effect in the reply text, the terminal determines a background sound effect to be superimposed from the music library. Specifically, the terminal sets an identification of an emotional polarity category for different segments (ie, sub-segments) of each music file in the music library in advance. For example, these segments are identified as the following emotional polarity types: happy, like, Sadness, surprise, anger, fear, disgust, etc. Assuming that the current reply text includes text with emotional polarity, after determining the emotional polarity category of these texts, the terminal device searches the music library for a music file with a corresponding emotional polarity category identifier.
  • the emotional polarity category and the identification of the emotional intensity are set in advance for each sub-segment in the music library, then these texts are determined After the emotion polarity category and emotion intensity of the subject, find a sub-segment combination with the corresponding emotion polarity category and emotion intensity identifier in the music library as the final selected background sound effect.
  • the terminal device selects the most matching background sound effect in the preset music library according to part or all of the reply text.
  • the terminal device can split the content in the reply text that needs to be superimposed with background sound effects into different parts (split or punctuated according to punctuation). Each part can be called a sub-content, and the emotional polarity type of each sub-content is calculated. And emotional intensity.
  • align the content with the matched background sound effect so that the emotional change of the content is basically consistent with the emotional change of the background sound effect.
  • the best-matching background sound effect includes a plurality of sub-segments, each of which has an identification of an emotional polarity type and an identification of an emotional intensity, and the emotional polarity indicated by the identification of the emotional polarity type of each sub-segment
  • the sexual type is the same as the emotional polarity type of each sub-content, and the change trend between the emotional intensity indicated by the identifier of the emotional intensity of each sub-segment and the emotional intensity of each sub-content The trends are consistent.
  • the reply text is "The weather is good, the national football team has won again, so happy.”
  • the entire content of the reply text needs to be superimposed with background sound effects.
  • the reply text is split into “Good weather,” " The national football team won the game again, and the sub-contents of the three parts "" Happy ", and the emotional polarity categories of each sub-content are all happy, and have different emotional strengths.
  • a music file whose emotional polarity category is happy is initially determined in the music library. Further, the emotional change trajectory of the music file can be calculated and counted to obtain the emotional intensity of the three sub-segments in the music.
  • the emotional change of the fragment is basically consistent with the emotional change trend of the three sub-contents of the reply text, so the music fragment composed of the three sub-segments in this music file is the background sound effect that matches the reply text. Therefore, you can align the three sub-segments of "the weather is good,” “the national football team won again,” and "happy” in the complex text.
  • the terminal device uses the selected acoustic model according to the The background sound effect (that is, the most matched music fragment), the basic speech synthesis information, and the enhanced speech synthesis information are used to speech synthesize the reply text, and the final reply speech output will present a "speech superimposed background sound effect" effect.
  • the current dialogue scene may also be a “song song humming” voice scenario.
  • the enhanced speech synthesis information used by the terminal device in the speech synthesis includes the “child song humming” Sing "voice scene parameters.
  • the speech synthesis method of the embodiment of the present invention is described below by taking a speech scene of "song humming (taking a nursery rhyme as an example)" as an example.
  • time is divided into equal basic units, and each basic unit is called a "beat" or a beat.
  • the time value of the beat is expressed by the time value of the note.
  • the time value of a beat can be a quarter note (that is, a quarter note is a beat), a half note (a quarter note is a beat), or eight. Quarter note (takes eighth note as a beat).
  • the rhythm of music is generally defined by beats, such as 4/4 beats: 4/4 beats are 4 quarter notes as a beat, and 4 beats per measure can have 4 quarter notes.
  • the voice scene parameters of the so-called “Children's Song Humming” are preset types of beats of various children's songs, and a method of text segmentation of the content of the reply text that needs to be synthesized in the manner of "Children's Song Humming".
  • the terminal determines that the voice scene of the current conversation is the voice scene of "child songs humming" by replying to text and context information.
  • the user's input voice contains a user's intention to clearly indicate that the current conversation is a "child song humming" voice scene.
  • the terminal can still determine whether the content of the reply text involves the content of children's songs through the DM module.
  • the DM module can search the local pre-stored nursery rhyme library or search the nursery rhyme library in the web server through text search matching or semantic analysis.
  • the lyric library can contain various lyrics of the nursery rhymes, and the DM module judges Whether the content of the reply text exists in these children's song lyrics, and if it exists, the current dialogue scene is set to the voice scene of "Children's Song Humming".
  • the terminal device may perform beat alignment on the content of the reply text to facilitate subsequent speech synthesis. Specifically, in a specific embodiment, the terminal may align the content of the reply text with the determined beat through the PM module, so as to ensure that each field of the text is fused with the change rule of the rhythm of the nursery rhyme. Specifically, the terminal aligns the cut text field with the time axis according to the change rule of the beat.
  • the 3 words can be aligned with the 3 beats in a measure.
  • the number of words in a field in the reply text is less than the number of beats in the measure. If the field is 2 words and the beat is 4/4 beats, then search for adjacent text fields before and after the field. The number of words in the field before the field (or the field after the field) is 2, then you can merge this field with the field before the field to align the 4 beats in the measure together. If the fields before and after cannot be merged, or the number of words after the merge is still less than the number of beats, you can further align the beats in the following ways: One way is to fill the part with less text than the number of beats with blanks. Another way is to align the rhythm by lengthening the sound length of a word. Another way is to lengthen the sound length of each word evenly to ensure the overall time alignment.
  • an embodiment of the present invention provides a speech synthesis device.
  • the device includes a processor and a memory coupled to the processor, where:
  • the memory is used to store an acoustic model library and a speech synthesis parameter database (may be referred to as a TTS parameter library).
  • the acoustic model library stores one or more acoustic models
  • the speech synthesis parameter database stores associations with the identity of the user.
  • the processor is configured to determine the identity of the user according to the current input voice of the user; obtain an acoustic model from the acoustic model library according to the current input voice, and the preset information of the acoustic model includes a preset sound velocity, a preset Set two or more of volume, preset pitch, preset tone, preset intonation, and preset prosody; determine basic speech synthesis information from the speech synthesis parameter database according to the identity of the user, so The basic speech synthesis information includes a change amount of one or more of the preset sound speed, the preset volume, and the preset pitch; determining a reply text according to the current input voice; and according to the reply text, all
  • the context information of the currently input speech determines the enhanced speech synthesis information from the speech synthesis parameter database, and the enhanced speech synthesis information includes one of the preset tone color, the preset intonation, and the preset rhythm.
  • a plurality of changes; the response text is processed according to the basic speech synthesis information and the enhanced speech
  • the processor is specifically configured to determine a literary style feature of the reply text according to the reply text, where the literary style feature includes part or all of the reply text One or more of the number of sentences of the content, the number of words per sentence, and the arrangement order of the number of sentences; selecting the corresponding amount of change in the preset rhythm from the speech synthesis parameter database according to the literary style characteristics involved in the reply text
  • the processor is specifically configured to determine a literary style feature of the reply text according to the reply text, where the literary style feature includes part or all of the reply text One or more of the number of sentences of the content, the number of words per sentence, and the arrangement order of the number of sentences; selecting the corresponding amount of change in the preset rhythm from the speech synthesis parameter database according to the literary style characteristics involved in the reply text
  • the amount of change in the preset rhythm represents the reading duration of characters in some or all of the content of the reply text, Changes in the reading pause position, reading pause time, and stress.
  • the preset information of the selected acoustic model further includes a language style feature
  • the language style feature specifically includes a mantra, a response mode to a specific scene, a type of wisdom, a type of personality, One or more of mixed popular languages or dialects, and titles for specific characters.
  • the processor is specifically configured to: determine the preferences of the user according to the identity of the user; and according to the user's It is preferred to select an acoustic model from the acoustic model library.
  • each acoustic model has an acoustic mode identifier; the processor is specifically configured to: Content, determining an acoustic mode identification related to the content of the currently input speech; selecting an acoustic model corresponding to the acoustic mode identification from the acoustic model library.
  • the processor is specifically configured to: select multiple acoustic models in the acoustic model according to the identity of the user; Determining a weight value of each of the plurality of acoustic models; wherein the weight value of each of the acoustic models is preset by a user, or the weight value of each of the acoustic models is based on a preference of the user in advance It is determined; the respective acoustic models are fused based on the weight values to obtain a fused acoustic model.
  • the processor is further configured to: before determining the identity of the user based on the user's current input voice, determine the target character and the user's preferred pronunciation based on the user's historical input voice.
  • the corresponding relationship between the target character and the user's preferred pronunciation is associated with the identity of the user, and the corresponding relationship between the target character and the user's preferred pronunciation is saved to the speech synthesis parameter database;
  • the processor is further specifically configured to: when the target character associated with the identity of the user exists in the reply text, according to a correspondence relationship between the target character and a user's preferred pronunciation through the acoustic model ,
  • the basic speech synthesis information and the enhanced speech synthesis information perform speech synthesis on the reply text.
  • the speech synthesis parameter database further stores a music library; the processor is further configured to: select a background sound effect from the music library according to the reply text, and the background sound effect Is a music or sound special effect; the processor is further specifically configured to perform speech synthesis on the reply text according to the background sound effect, the basic speech synthesis information, and the enhanced speech synthesis information through the acoustic model.
  • the background sound effect has one or more identifiers of emotional polarity types and identifiers of emotional strength; the identifiers of the emotional polarity types are used to indicate at least one of the following emotions: happiness , Like, sadness, surprise, anger, fear, disgust; the identifier of the emotional intensity is used to indicate the respective value of the at least one emotion; the processor is specifically configured to: split the content of the reply text Into a plurality of sub-contents, and determine the emotional polarity type and emotional intensity of each sub-content respectively; according to the emotional polarity types and emotional intensity of each sub-content, select the most matching background sound effect in the music library;
  • the best-matching background sound effect includes a plurality of sub-segments, each of which has an identification of an emotional polarity type and an identification of an emotional intensity, and the type of emotional polarity indicated by the identification of the emotional polarity type that each of the sub-segments has respectively corresponds to The types of emotion polar
  • the device may further include an audio circuit.
  • the audio circuit can provide an audio interface between the device and the user, and the audio circuit can further be connected with a speaker and a microphone.
  • the microphone can collect the user's voice signals, and convert the collected voice signals into electrical signals, which are received by the audio circuit and converted into audio data (that is, forming the user's input voice), and then the audio data is transmitted to the processor for voice Processing, on the other hand, the processor 2011 synthesizes the reply speech based on the user's input speech and transmits it to the audio circuit.
  • the audio circuit can convert the received audio data (that is, the reply speech) to an electrical signal, and then transmit it to the speaker.
  • the sound signal is converted by the speaker and output.
  • an embodiment of the present invention provides a speech synthesis device, which is characterized in that the speech synthesis device includes a speech recognition module, a speech dialogue module, and a speech synthesis module, wherein:
  • a voice recognition module for receiving a user's current input voice
  • a voice dialogue module configured to determine the identity of the user based on the user's current input voice; determine basic speech synthesis information based on the identity of the user, and the basic speech synthesis information includes a preset sound velocity, a preset volume, and a preset of an acoustic model. Set one or more changes in pitch; determine a reply text based on the current input speech; determine enhanced speech synthesis information based on the reply text and context information, the enhanced speech synthesis information including a preset of the acoustic model The amount of change in one or more of timbre, preset intonation, and preset rhythm;
  • a speech synthesis module configured to obtain the acoustic model from a preset acoustic model library according to the current input voice, and the preset information of the acoustic model includes the preset sound speed, the preset volume, the preset Setting a pitch, the preset tone color, the preset intonation, and the preset prosody rhythm; and using the acoustic model to voice the reply text according to the basic speech synthesis information and the enhanced speech synthesis information synthesis.
  • the speech recognition module, speech dialogue module, and speech synthesis module are specifically configured to implement the speech synthesis method described in the first aspect.
  • an embodiment of the present invention provides a computer-readable storage medium, where the computer-readable storage medium stores instructions, and when the computer-readable storage medium is run on a computer, causes the computer to execute the method described in the first aspect.
  • an embodiment of the present invention provides a computer program product containing instructions, which when run on a computer, causes the computer to execute the method described in the first aspect above.
  • the terminal can select different TTS parameters for different users based on the reply text of the dialogue interaction and the dialogue context information, so as to automatically combine user preferences and dialogue scenarios to generate different styles of Reply to the voice, provide personalized speech synthesis effect to different users, greatly improve the voice interaction experience between the user and the terminal, and improve the timeliness of human-machine dialogue.
  • the terminal also allows the user to tune the terminal's voice response system in real time by voice, and update the TTS parameters associated with the user's identity and preferences, making the tuned terminal closer to the user's interaction preferences and maximizing the user's interactive experience.
  • FIG. 1 is a schematic diagram of basic physical elements of speech according to an embodiment of the present invention
  • FIG. 2 is a schematic diagram of a system architecture according to an embodiment of the present invention.
  • FIG. 3 is a schematic diagram of still another system architecture according to an embodiment of the present invention.
  • FIG. 4 is a schematic structural diagram of a system architecture and a terminal device according to an embodiment of the present invention.
  • FIG. 5 is a schematic diagram of a TTS parameter database provided by an embodiment of the present invention.
  • FIG. 6 is a schematic diagram of an acoustic model library provided by an embodiment of the present invention.
  • FIG. 7 is a schematic diagram of a speech synthesis process according to an embodiment of the present invention.
  • FIG. 8 is a schematic diagram of speech synthesis of a reply text provided by an embodiment of the present invention.
  • FIG. 9 is a schematic structural diagram of still another system architecture and a terminal device according to an embodiment of the present invention.
  • FIG. 10 is a schematic flowchart of a speech synthesis method according to an embodiment of the present invention.
  • FIG. 11 is an exemplary diagram of basic TTS parameters associated with a user identity according to an embodiment of the present invention.
  • FIG. 12 is an exemplary diagram of a customized character pronunciation table provided by an embodiment of the present invention.
  • FIG. 13 is an exemplary diagram of an emotional parameter correction mapping table according to an embodiment of the present invention.
  • FIG. 14 is an exemplary diagram of a speech emotion parameter associated with a user identity according to an embodiment of the present invention.
  • FIG. 15 is an exemplary diagram of a scene parameter modification mapping table provided by an embodiment of the present invention.
  • FIG. 16 is an exemplary diagram of a voice scene parameter associated with a user identity according to an embodiment of the present invention.
  • 17-19 are exemplary diagrams of calling instructions corresponding to a reply text provided by an embodiment of the present invention.
  • 20 is a schematic flowchart of a method for updating a customized character pronunciation table according to an embodiment of the present invention
  • 21 is a schematic flowchart of a method for determining a TTS parameter required for a current reply text according to an embodiment of the present invention
  • FIG. 22 is a schematic flowchart of a speech scene-related speech synthesis method of "poem recitation" provided by an embodiment of the present invention
  • FIG. 23 is a schematic diagram of aligning a rhythmic template with content of a reply text according to an embodiment of the present invention.
  • FIG. 24 is a schematic flowchart of a speech scene-related speech synthesis method for a “song humming” according to an embodiment of the present invention.
  • FIG. 25 is a schematic diagram of performing beat alignment on content of a reply text according to an embodiment of the present invention.
  • FIG. 26 is a schematic flowchart of a scene-related speech synthesis method for “character imitation” according to an embodiment of the present invention.
  • FIG. 27 is an exemplary diagram of sound characteristics corresponding to sound characteristics of some specific acoustic models according to an embodiment of the present invention.
  • FIG. 28 is a schematic diagram of an interface for selecting a parameter of a speech feature and a parameter of a language style feature according to an embodiment of the present invention
  • 29 is a schematic flowchart of a speech synthesis method for a scene with a background sound effect superimposed according to an embodiment of the present invention
  • FIG. 30 is a schematic diagram of determining a most matching music segment according to an embodiment of the present invention.
  • FIG. 31 is a schematic structural diagram of a hardware device according to an embodiment of the present invention.
  • Speech which is the sound of language, is the sound wave form of a language communication tool. Speech realizes the language's expression and social functions.
  • the basic physical elements of speech include sound intensity, sound length, pitch, and tone color. See Figure 1, which are described as follows:
  • the sound intensity may be called volume, tone, stress, stress, and so on.
  • the sound intensity is determined by the amplitude of the sound wave, which is directly proportional to the amplitude of the sound wave, indicating the strength of the sound. Sound intensity has the function of distinguishing the meaning of words and certain grammatical functions in Chinese. For example, sound intensity determines the meaning of soft sound and stress.
  • the duration indicates the duration of the sound wave vibration. It is determined by the duration of the sound body vibration. The longer the vibration time, the longer the sound wave.
  • the sound length can be represented by the concept of speed. The speed of sound indicates the speed of the sound. That is, the longer the sound length, the slower the sound speed.
  • Pitch sometimes called pitch.
  • the pitch is determined by the frequency of the vibration of the sound wave. The higher the frequency, the higher the pitch.
  • the tone of Chinese characters and the intonation of sentences are mainly determined by the pitch.
  • Timbre may be called sound quality, voice quality, etc.
  • the tone color represents the characteristics and nature of the sound, and different tone colors correspond to different zigzag forms (sound wave shapes) of sound wave ripples.
  • the timbre is the basic characteristic of a sound that is different from other sounds, and the timbre of different people (or pronunciation bodies) is different.
  • Chinese is different from the Western language family, and its manifestations are in grammatical structure, grammatical rules, acoustic characteristics, and prosodic structure.
  • Chinese characters are one character and one sound, that is, a syllable is generally a Chinese character.
  • Tones are an integral part of the syllable structure. Tones are usually used to indicate the rise and fall of a syllable, so the tone is also called the tone. .
  • the formation of tones is mainly determined by changes in pitch, in addition to changes in pitch. During the pronunciation process, the pronunciation body can adjust the changes in pitch and length at any time, so that different tones are formed. Tones play an important role in distinguishing meanings.
  • tones are used to distinguish the meaning of the words “theme” and “genre”, “exercise”, and “connection” in Chinese speech.
  • each character has a corresponding fundamental frequency (the frequency of the fundamental sound, which determines the pitch of the basic sound of the character), and the fundamental frequencies between the characters may also affect each other to produce a sound.
  • the fundamental frequency variation ie, the phenomenon of sound change.
  • there is a pause in the pronunciation of consecutive sentences and different words in the sentence will be light or accented according to the upper and lower semantics.
  • the system architecture of the embodiment of the present invention relates to a user and a terminal.
  • the user inputs a voice to the terminal, and the terminal can process the user's voice through a voice response system to obtain a voice for the user and present the reply to the user.
  • the terminal in the embodiment of the present invention may be a dialog interactive robot, a home / commercial robot, a smart speaker, a smart table lamp, a smart home appliance, a smart furniture, a smart vehicle, or a mobile phone, a notebook computer, a tablet computer, etc. Voice assistant / voice conversation software on the device.
  • the terminal is a robot
  • the user sends a voice to the robot (for example, the user speaks directly to the robot), and the robot responds to the user with a voice (for example, the robot plays through a buzzer) Replied voice), thereby realizing a human-machine dialogue between the user and the robot.
  • a voice for example, the robot plays through a buzzer
  • the terminal is a voice assistant applied to a smart phone, and the user sends a voice to the voice assistant (for example, the user triggers a voice assistant related icon displayed on the smart phone) Speech), the voice assistant responds to the user as a response (for example, the voice displays the voice message on the screen, and the bouncer plays the reply voice), thereby realizing the interactive dialogue between the user and the voice assistant.
  • a voice assistant for example, the user triggers a voice assistant related icon displayed on the smart phone
  • the voice assistant responds to the user as a response (for example, the voice displays the voice message on the screen, and the bouncer plays the reply voice), thereby realizing the interactive dialogue between the user and the voice assistant.
  • the terminal may also be a server.
  • the terminal sends a voice to a smart phone
  • the smart phone transmits voice information to the server
  • the server obtains a reply voice based on the voice information, and the reply voice
  • the smart phone presents the reply voice to the user (such as displaying voice information on the screen and playing the reply voice through the buzzer, etc.), thereby realizing the interactive dialogue between the user and the server.
  • FIG. 4 shows a voice response system 10 of a terminal in a system architecture.
  • the voice response system 10 includes a voice recognition module 101, a voice dialog module 102, and a voice synthesis module 103.
  • the functions of each module are described as follows:
  • Speech recognition Automatic speech recognition, ASR
  • the ASR module 101 is used to recognize the content of a user's input speech, recognize the content of the speech into text, and realize the conversion from "speech" to "text”.
  • the voice dialogue module 102 which can be used to generate a reply text based on the recognition text input by the ASR module 101, and transmit the reply text to the voice synthesis module 103; the voice dialogue module 102 is also used to determine the personalization corresponding to the reply text TTS parameters of the mobile phone to facilitate subsequent speech synthesis module 103 to perform speech synthesis on the reply text based on the relevant TTS parameters.
  • the voice dialog module 102 may specifically include the following modules:
  • NLU Natural Language Understanding
  • the NLU module 1021 can be used to perform grammatical analysis and semantic analysis on the recognition text input by the ASR module 101, so as to understand the content of the user's speech (voice).
  • Natural Language Generation (NLG) module 1022 the NLG module 1022 can be used to generate a corresponding reply text based on the content of the user's speech and context information.
  • a dialog management (Dialogue Management, DM) module 1023 is used to track the current session state and control the dialog strategy.
  • the UM module 1024 is responsible for user identity confirmation, user information management, etc.
  • the UM module 1024 can use existing identity recognition systems (such as voiceprint recognition, face recognition Even multi-modal biometrics) to determine user identity.
  • the intent recognition module 1025 can be used to identify the user intention indicated by the user's speaking content.
  • corpus knowledge related to TTS parameter setting may be added to the intent recognition module 1025, and the intent recognition module 1025 may identify an interaction intention that a user wants to set (update) for one or more TTS parameters.
  • TTS parameter database 1026 is used to store basic TTS parameters (or basic speech synthesis information), enhanced TTS parameters (or enhanced speech synthesis information), custom character reading tables, music libraries, etc. The information is described as follows:
  • the basic TTS parameter represents a change in one or more of a preset sound velocity, a preset volume, and a preset pitch of an acoustic model used in synthesizing speech.
  • the basic TTS parameter is associated with a user's identity, and That is to say, different basic TTS parameters can be organized according to the identity of the user (or according to the preferences of the user).
  • the enhanced TTS parameter represents a change in one or more of a preset tone color, a preset tone, and a preset prosody rhythm of an acoustic model used in synthesizing speech.
  • the enhanced TTS parameter may further It is classified into speech emotion parameters and speech scene parameters.
  • the speech emotion parameters are used to make the speech synthesized by the acoustic model present specific emotional characteristics. According to the different emotional characteristics, the speech emotion parameters can be further classified into neutral emotion, mild happiness, moderate happiness, extreme happiness, light For parameters such as sadness and moderate sadness, please refer to the detailed description below for specific implementation.
  • the speech scene parameters are used to make the speech synthesized by the acoustic model present specific scene characteristics.
  • the speech scene parameters can be further divided into daily conversation, poetry recitation, song humming, storytelling, News broadcast and other parameters, that is to say, the use of these voice scene parameters in speech synthesis will enable the synthesized speech to show the sound effects of daily dialogue, poetry recitation, song humming, storytelling, news broadcast and other voice scenes.
  • the specific implementation method can be Refer to the detailed description later.
  • the customized character pronunciation table includes a mapping relationship between a target character and a user's preferred pronunciation.
  • the target character may be a character (Chinese character or other character), a letter, a number, a symbol, or the like.
  • the mapping relationship between the target character and the user's preferred pronunciation is used to enable the target character involved in the speech synthesized by the acoustic model to have the user's preferred pronunciation.
  • the mapping relationship between the target character and the user's preferred pronunciation is related to the identity of the user, that is, different mapping relationships can be organized according to the identity of the user. For specific implementation, please refer to the detailed description later.
  • the music library includes a plurality of music information, and the music information is used to provide a background sound effect in a speech synthesis process.
  • the background sound effect may be specific music or a sound special effect.
  • the background sound effect is used to superimpose different styles and rhythms of music or sound effects on the speech background synthesized by the acoustic model, thereby enhancing the expression effect of the synthesized speech (such as enhancing the emotional effect).
  • specific implementation methods refer to the following. Detailed Description.
  • the PM module 1027 is used to manage TTS parameters in the TTS parameter database, and the management method includes performing query on one or more TTS parameters according to the user's intention to set the TTS parameters , Add, delete, update (change), select, get (OK), etc.
  • the PM module 1027 may be used to determine a basic TTS parameter associated with the user according to the identity of the user, and to determine an enhanced TTS parameter used to enhance the speech synthesis effect according to the content and context information of the reply text.
  • the TTS module 103 is used to convert the reply text generated by the voice dialog module 102 into a reply voice, so as to present the reply voice to the user.
  • the TTS module 103 may specifically include the following modules:
  • Instruction generation module 1031 instruction generation module 1031 may be configured to generate or update a call instruction based on the reply text and TTS parameters (including basic TTS parameters and enhanced TTS parameters) transmitted from the voice dialog module 102, and the call instructions may be applied to the TTS engine 1032.
  • TTS engine 1032 is used to call or generate the appropriate instruction model from the acoustic model library 1033 according to the calling instruction generated or updated by the instruction generation module 1031, and use the acoustic model to strengthen the TTS based on the basic TTS parameters based on the acoustic model.
  • the parameters, the mapping relationship between the target character and the user's preferred pronunciation, background sound effects, and other information are used to synthesize the reply text to generate a reply speech and return the reply speech to the user.
  • the acoustic model library 1033 may include multiple acoustic models, such as a general acoustic model, and several personalized acoustic models, and so on. These acoustic models are all neural network models, and these neural network models can be trained in advance from different corpora.
  • each acoustic model has its own preset information, that is, each acoustic model is bound to a specific preset information. These preset information can be used as the basic input information of the acoustic model.
  • the preset information of the general acoustic model may include two or more of the preset sound speed, preset volume, preset pitch, preset tone, preset tone, and preset rhythm of the model; personality
  • the preset information of the acoustic model includes two or more of the preset sound speed, preset volume, preset pitch, preset tone, preset tone, and preset rhythm.
  • Other personalized information such as mantras, ways of responding to specific scenes, types of wisdom, personality types, mixed popular languages or dialects, appellations of specific characters, and other language style characteristics. It should be understood that the preset sound speed, preset volume, preset pitch, preset tone, preset tone, preset rhythm, and other preset information of different acoustic models also differ.
  • the preset information of the model may be significantly different from the preset information of the general acoustic model.
  • the acoustic model can convert the reply text into a reply voice according to the preset information and the change information of the preset information.
  • the change information of the preset information referred to here means information such as a basic TTS parameter, an enhanced TTS parameter, a mapping relationship between a target character and a user's preferred pronunciation, and a background sound effect selected in speech synthesis.
  • the speech synthesized through the general acoustic model presents the sound effects under normal and general dialogue scenarios, while the speech synthesized through the personalized acoustic model can "simulate" the sound effects of dialogue scenarios.
  • the implementation method of the dialogue scene of "character imitation" will be described in detail later.
  • each module in the embodiment shown in FIG. 4 may be a software module, and these software modules may be stored in the memory of the terminal device, and the processor in the terminal device calls these modules in the memory to Perform a speech synthesis method.
  • the implementation form of each module in the embodiment in FIG. 4 may be a hardware component in a terminal device.
  • the voice response system obtains the user's input voice
  • the response text is obtained through the voice recognition module and the voice dialogue module.
  • the voice dialogue module determines the basic TTS parameters associated with the identity from the TTS parameter database based on the current user identity; based on the reply text
  • the context information determines the enhanced TTS parameters and background sound effects from the TTS parameter database; if there is a target character associated with the user identity in the reply text, it also determines the user's preferred pronunciation corresponding to the target character.
  • the speech synthesis module calls the appropriate acoustic model from the acoustic model library based on the user's input speech or the user's preference (the user's preference is associated with the user's identity) or the reply text, and combines the TTS parameters (basic TTS parameters, enhanced TTS parameters, the mapping relationship between the target character and the user's preferred pronunciation, and one or more of the background sound effects) to perform speech synthesis to generate a reply speech for presentation to the user.
  • the TTS parameters basic TTS parameters, enhanced TTS parameters, the mapping relationship between the target character and the user's preferred pronunciation, and one or more of the background sound effects
  • FIG. 8 shows a speech synthesis process in an application scenario.
  • the voice response system obtains the user ’s
  • the reply text obtained through the voice recognition module and the voice dialogue module is "the weather is very good today”.
  • the voice dialogue module determines the basic TTS parameters associated with the identity of the user, and determines the voice based on the content and context information of the reply text Enhanced TTS parameters such as emotion parameters and voice scene parameters, and the background sound effect is determined based on the content of the reply text.
  • the speech synthesis module can use the selected acoustic model based on the selected basic TTS parameters, voice emotion parameters, and voice scene parameters. Synthesize the reply text with the background sound effect, and finally generate a synthesized speech (jin1, tian1, tian1, qi4, hen3) for replying to the user.
  • FIG. 4 is only a specific implementation manner of the present invention. In other possible implementation manners of the present invention, it may also include more or fewer functional modules, and between the functional modules described above. There may also be appropriate splits, combinations, changes to deployment methods, and so on.
  • the acoustic model library 1033 can be deployed in the TTS engine 1032 to make it easier for the TTS engine to call the acoustic model and perform speech synthesis through the acoustic model.
  • the acoustic model library 1033 may also be deployed in the voice dialogue module 102, or deployed outside the voice dialogue module 102.
  • the PM module 1027 and the TTS parameter database 1026 may also be integrated together and independently deployed at a location outside the voice dialog module 102.
  • the PM module 1027 may also be specifically deployed in the TTS engine 1032, that is, "TTS parameter management" may be implemented as a function of the TTS engine 1032.
  • the intent recognition module 1025 may also be specifically deployed in the DM module 1023, that is, “intent recognition” may be implemented as a function of the DM module 1023.
  • the TTS parameter database 1026 may be specifically deployed in the PM module 1027, that is, the PM module 1027 may organize and store TTS parameters by category and user identity; or, the TTS parameter database 1026 may also be used in a voice conversation Independent deployment at locations other than the module 102; or, the acoustic model library 1033 can be independently deployed at a location other than the TTS module 103; or, the acoustic model library 1033 can also be deployed with the TTS parameter library 1026, and so on.
  • the PM module 1027 may be split into a basic TTS parameter management module 1028 and an enhanced TTS parameter management module 1029.
  • the basic TTS parameter management module 1028 is used to manage the basic TTS parameters and customized character pronunciation tables in the TTS parameter database 1026.
  • the management method includes executing one or more basic TTS parameters according to the user's intention to set the basic TTS parameters. Query, add, delete, update (change), select, get (OK) and other operations, and perform query, add, delete, update on the custom character pronunciation table according to the user's intention to set the user's preferred pronunciation corresponding to the target character (Change), select, get (OK), etc.
  • the basic TTS parameter management module 1028 can also be used to obtain the basic TTS parameters associated with the user identity, the user's preferred pronunciation corresponding to the target character, and so on.
  • the enhanced TTS parameter management module 1029 is used to manage the enhanced TTS parameters and music library in the TTS parameter database 1026.
  • the management method includes performing query on one or more enhanced TTS parameters according to the user's intention to set the enhanced TTS parameters, adding new , Delete, update (change), select, get (OK), and perform queries, add, delete, update (change), select, get (OK) on the music library according to the user's intention to set the background sound effect And so on.
  • the enhanced TTS parameter management module 1029 can obtain the enhanced TTS parameters and background sound effects used to enhance the speech synthesis effect according to the content and context information of the reply text.
  • each module in the foregoing embodiment in FIG. 9 may be a software module, and these software modules may be stored in the memory of the terminal device, and the processor in the terminal device calls these modules in the memory to Perform a speech synthesis method.
  • the implementation form of each module in the foregoing embodiment in FIG. 9 may be a hardware component in a terminal device.
  • the enhanced TTS parameter management module 1029 may also be deployed in the TTS engine 1032, that is, "enhanced TTS parameter management" may be implemented as a function of the TTS engine 1032.
  • the voice dialogue module can generate the corresponding reply text on the one hand, and can be based on Dialogue interactive response text and dialogue context information, combined with the current user's identity, preferences and dialogue context, to select personalized TTS parameters, and then the TTS module can generate specific styles of reply speech based on these personalized TTS parameters to provide users with personalized
  • the speech synthesis effect greatly improves the voice interaction experience between the user and the terminal, and improves the timeliness of human-machine dialogue.
  • the terminal also allows the user to tune the terminal in real time through voice, and update the TTS parameters associated with the user's identity and preferences, making the tuned terminal closer to the user's interaction preferences and maximizing the user's interactive experience.
  • the method process includes but is not limited to the following steps:
  • Step 101 The user inputs a voice to the terminal, and accordingly, the terminal obtains the voice input by the user.
  • the terminal in the embodiment of the present invention may be a dialog interactive robot, a home / commercial robot, a smart speaker, a smart table lamp, a smart home appliance, a smart furniture, a smart vehicle, or a mobile phone, a notebook computer, a tablet computer, etc.
  • Voice assistant / voice conversation software on the device.
  • Step 102 The terminal recognizes the content of the voice input by the user, and recognizes the voice as text.
  • the terminal can recognize the content of the user's input voice through the ASR module of its voice response system, for example, the content of the user's input voice is identified as: “Speak too slowly, please speak faster”, “Can the voice be spoken? Bigger ",” What was the last sentence of someone in the depths of Baiyun "and so on.
  • the ASR module can be directly implemented by using a current commercial ASR system. Those skilled in the art are familiar with the implementation manner, and will not be described here.
  • Step 103 The terminal determines the identity of the user.
  • the terminal may recognize the identity of the user through the UM module of its voice response system.
  • the UM module may determine the voice input person through voiceprint recognition, face recognition, or even multi-modal biometric recognition ( (Ie user). Understandably, if the terminal recognizes that the user identity is a locally registered user (such as the current user is xiaoming), the TTS parameters corresponding to the user can be subsequently adjusted; if the terminal cannot identify the user identity, it is determined that the user is a stranger user (If the current user is xiaohua), the default TTS parameters can be adjusted subsequently.
  • the UM module may determine the voice input person through voiceprint recognition, face recognition, or even multi-modal biometric recognition (Ie user). Understandably, if the terminal recognizes that the user identity is a locally registered user (such as the current user is xiaoming), the TTS parameters corresponding to the user can be subsequently adjusted; if the terminal cannot identify the user identity, it is determined that the user is a stranger user (If the current user is
  • Step 104 The terminal determines the user's speaking intention.
  • the terminal may determine the user's intention to speak in combination with the NLU module and the intent recognition module of its voice response system.
  • the implementation process includes the following:
  • the NLU module performs text analysis on the recognized text, including word segmentation, semantic analysis, part-of-speech analysis, etc. Keywords / words.
  • related keywords / words for TTS parameter setting may include "sound", “volume”, “speaking speed”, “pronouncement”, “feeling”, “reciting”, “fast”, “slow”, “happy” , “Sadness” and so on.
  • the intent recognition module combines the context of the dialogue to perform reference resolution and sentence completion, and then can use template matching or statistical model to identify whether the user has the intention to update TTS parameters.
  • the reference resolution refers to the Recognize which noun phrase the pronoun points to in the text.
  • the template matching method can first analyze the combinations of keywords and words appearing in common instructions, and then construct templates / rules to match specific intents, such as "... sound / speak / speak / read in text sentences ... Slow / fast... ”, it can be considered that the user ’s speaking intention is to adjust the speed of sound in the basic TTS parameters corresponding to the user (such as the speed of sound is increased or decreased by 20%); for example,“... sound appears in the text sentence / Speak / speak / read... loud / whisper / loud / small... ”, you can consider that the user ’s speaking intention is to adjust the volume in the basic TTS parameters corresponding to the user (such as increasing or decreasing the volume by 20% ); If a sentence template such as "[word 1] should be pronounced / read ...
  • Training algorithms include, but are not limited to, Support Vector Machines (SVM) algorithms, Naive Bayes algorithms, Decision Tree algorithms, Neural Networks (NN) algorithms. Wait. In this way, after the model is trained, when the user's speaking intent needs to be determined, the keywords / words corresponding to the user's spoken text sentence are input to the model to determine the speaking intent corresponding to the text sentence.
  • SVM Support Vector Machines
  • NN Neural Networks
  • the trained model can be classified in advance based on the dialogue area or topic type, such as being divided into "weather” category, "poem category”, “song category”, “news category”, “life communication category”, “ Movie “,” sports “and so on.
  • the intent recognition module can determine the conversation area or topic type based on the current conversation state and the keywords / words of the text sentence, and then the intent recognition module takes the keywords / words as input first.
  • the corresponding dialog domain model or topic type model the corresponding speech intent of the text sentence is determined.
  • Step 105 The terminal determines whether the user's speaking intention is to set the TTS parameters.
  • Step 106 If it is determined that the speaking intention is to set the TTS parameters (such as update, delete, and add operations), the terminal executes the setting operation of the TTS parameters according to the instruction of the speaking intention.
  • the TTS parameters include basic TTS parameters such as the speed of sound, volume, and individual changes associated with the identity of the user, as well as custom character pronunciation tables, etc .; the TTS parameters also include enhanced emotional TTS parameters, voice scene parameters, and other TTS parameters and background. Sound effects, etc. It should be understood that in a possible implementation, the enhanced TTS parameter may be associated with the identity of the user, or it may not need to be associated with the identity of the user.
  • the setting operations are operations such as adding TTS parameters, deleting TTS parameters, and updating (changing) TTS parameters.
  • an update operation may be performed on a TTS parameter associated with the user identity. If the user is an unregistered user, a local user identity may be created / registered for the user. The local user identity is initially associated with the default TTS parameters, and then the default TTS parameters associated with the user identity are updated.
  • the terminal may use the PM module of the voice response system to update the TTS parameter associated with the user identity in the TTS parameter database according to the TTS parameter update instruction issued by the voice dialogue module (such as the NLU module and / or the intent recognition module). Perform the update operation.
  • the voice dialogue module such as the NLU module and / or the intent recognition module.
  • the basic TTS parameter represents the amount of change (or change coefficient) relative to the physical elements of the basic speech.
  • the amount of change in the preset sound speed, preset volume, and preset pitch is , Can be organized and stored according to user identity, see FIG. 11, which shows an exemplary chart of basic TTS parameters associated with user identity, as shown in FIG. 11, the array in the chart represents The rising / falling ratio of the preset sound speed, preset volume, and preset pitch of the selected acoustic model.
  • the chart includes unregistered users and registered users.
  • an unregistered user means a user who has not yet performed identity registration or failed authentication, and the associated preset sound speed, preset volume, and preset pitch change are all default values of 0; registered users indicate that identity registration has been performed and
  • the authenticated users include, for example, "xiaoming”, “xiaoming_mom”, “xiaoming_grandma”, “xiaoming_dad”, and so on.
  • the basic TTS parameters of the associated sound speed, volume, and pitch are: "-40%, + 40%, + 20%", that is, the user's When speaking, the basic speech corresponding to the reply text will reduce the speed of sound by 40%, increase the volume by 40%, and increase the pitch by 20%.
  • the registered users ’preset sound speed, preset volume, and preset pitch changes can be added, corrected / changed, and deleted.
  • the terminal speaks based on" xiaoming " The intention is to "increase the volume” and increase the change in the preset volume associated with “xiaoming” to "+ 20%” based on the default value "0"; for example, the terminal intends to "reduce the speed of sound” based on "xiaoming_mom” The amount of change in the preset sound speed associated with "xiaoming_mom” is reduced to "+ 20%" from the original "+ 40%", and so on.
  • the customized character pronunciation table may be organized and stored according to user identity.
  • FIG. 12 shows an exemplary diagram of a customized character pronunciation table associated with a user identity.
  • the custom character pronunciation table corresponding to an unregistered user is empty, and the custom character pronunciation table corresponding to a registered user can be added, changed, or deleted based on the user's preference.
  • the object of the setting operation may be a character, a person / place name, a letter, a special symbol, etc. that are easily misunderstood by the terminal or the user likes.
  • the customized character pronunciation table includes the mapping relationship between the target character (string) and the user's preferred pronunciation.
  • the target character can be a word (Chinese character or foreign language), a word, a phrase, a sentence, or a number, a symbol ( Such as Chinese characters, foreign characters, emoji, punctuation marks, special symbols ...) and so on.
  • the terminal s original pronunciation table “Piggy Page” is pronounced “xiao3, zhu1, pei4, qi2”. If “xiaoming” is intended to be spoken, the pronunciation of “odd” in the phrase “piggy page” is set to “ki1”.
  • the terminal writes a "piglet page” and "xiao3 zhu1 pei4 ki1" into a custom character pronunciation table associated with "xiaoming" as a mapping relationship. It can be understood that the chart shown in FIG. 12 is merely an example and not a limitation.
  • the speech emotion parameter represents the change of intonation in the voice.
  • tone change refers to the rise and fall of the pitch of the sound in the voice, and the importance of the volume. , The speed of sound, the pause / dwell time of speech and text, etc. These changes have a very important effect on the expression of the voice.
  • the voice can present complex emotions such as joy, joy, sadness, sadness, sadness, hesitation, ease, firmness, and heroism.
  • the TTS parameter database maintains a mapping relationship between “speech emotion suggested by the voice dialog module” and “speech emotion parameter”.
  • the mapping relationship is, for example, the emotion parameter correction mapping table shown in FIG. 13.
  • the speech synthesized based on different speech emotion parameters will carry the corresponding emotional muzzle.
  • the speech emotion suggested by the speech dialogue module is "Neutral", then the speech synthesis module synthesizes speech emotion parameters based on neutral emotions.
  • the voice of the voice will reflect the tone of neutral emotion (that is, without any emotional characteristics); the voice emotion suggested by the voice dialogue module is "Happy_low”, then the voice synthesis module synthesizes the voice emotion parameters based on the mildly happy voice
  • the voice is a tone with mild happiness; the voice emotion suggested by the voice dialogue module is "Sad_low”, then the voice synthesized by the voice synthesis module based on the voice emotion parameters of mild sadness is mild sadness Muzzle, wait.
  • the chart shown in FIG. 13 is only an example and not a limitation.
  • the speech emotion parameters are also related to the reply text and context information.
  • the default voice emotion parameters associated with the user identity can correspond to neutral emotions.
  • the terminal can comprehensively determine the current voice synthesis process based on the user identity, reply text, and context information. Speech emotion parameters.
  • the terminal selects the user's default voice emotion to apply to the final speech synthesis, for example, the user's default voice emotion is "Neutral sentiment", the terminal determines that the speech synthesis of the current reply text has no specified speech sentiment, the terminal still applies "neutral sentiment" to the synthesis of the final speech; if the terminal determines that the reply text and context information need to specify the speech sentiment, and The specified voice emotion is not consistent with the user's default voice emotion, then the terminal automatically adjusts the current voice emotion to the voice emotion specified by the terminal, for example, the user's default voice emotion is "neutral emotion", but the terminal determines the speech synthesis of the current reply text If "slightly happy" voice emotion is needed, then the terminal adopts the "slightly happy” voice emotion parameter for final speech synthesis.
  • the terminal selects the user's default voice emotion to apply to the final speech synthesis, for example, the user's default voice emotion is "Neutral sentiment"
  • the terminal determines that the speech synthesis of the current reply text has no specified speech sentiment, the
  • the terminal may update the voice emotion parameters associated with the identity of the user based on the user's speaking intention. As shown in FIG. 14, the terminal may change the voice emotion parameters associated with “xiaoming_grandma” according to the speaking intent of “xiaoming_grandma”, that is, change the voice emotion parameters of the “neutral emotion” to the voice emotion parameters of “lightly happy”. It can be understood that the chart shown in FIG. 14 is only an example and not a limitation.
  • the speech scene parameters in the enhanced TTS parameters represent the change of the rhythm in the speech.
  • the so-called rhythmic rhythm change has a clearer and clearer rhythmic rhythm and strong emotional expression than the rhythmic rhythm in the natural state of ordinary dialogue, so that the voice dialog fits the specific application scenario.
  • the change of rhythmic rhythm can be reflected in Speech pause position / pause time change, accent position change, word / word sound length change, word / word sound speed change, etc.
  • the specific changes in these rhythmic rhythms can specifically present the voice scene effects such as "poetry recitation”, “song humming (or nursery rhyme)", "story telling", and "news broadcasting".
  • the TTS parameter database maintains a mapping relationship between “speech scenarios suggested by the voice dialogue module” and “speech scenario parameters”, and the mapping relationship is, for example, the scenario parameter modification mapping table shown in FIG. 15.
  • the speech synthesized based on different voice scene parameters will reflect the corresponding scene muzzle.
  • the speech synthesized based on the voice scene parameters of daily dialogue reflects the tone of daily conversation, and is synthesized based on the speech scene parameters of poetry recitation.
  • Voice reflects the tone of poetry recitation
  • speech synthesized based on the voice scene parameters of song humming represents the tone of song humming, and so on.
  • the chart shown in FIG. 15 is merely an example and not a limitation. In a possible embodiment, other voice scene parameters may also be designed based on the needs of actual applications, such as story interpretation, news broadcast, and the like.
  • the voice scene parameters are mainly related to the reply text and context information.
  • the voice scene corresponding to the default voice scene parameter associated with the user identity is “daily conversation”.
  • the terminal may comprehensively determine the current situation based on the user identity, reply text, and context information. Parameters of the speech scene used in the speech synthesis process. For example, if the terminal determines that the reply text and context information do not specify a voice scene, or the designated voice scene is consistent with the user's default voice scene, the terminal selects the user's default voice scene parameters to apply to the final speech synthesis.
  • the terminal determines that the speech synthesis of the current reply text has no specified speech scene, the terminal still applies “daily conversation” to the synthesis of the final speech; if the terminal determines that the reply text and context information require If the specified voice scene is inconsistent with the default voice scene of the user, the terminal automatically adjusts the current voice scene to the voice scene specified by the terminal. For example, the user's default voice emotion is "daily conversation”, but the terminal determines that the speech synthesis of the current response text requires a "poem recitation" speech scene, and then the terminal applies the speech scene parameters corresponding to "poem recitation” to the final speech synthesis.
  • the terminal may update the default voice scene parameters associated with the identity of the user based on the user's speaking intention. As shown in FIG. 16, the terminal may change the voice scene corresponding to the default voice scene parameter of “xiaoming_dad” from “daily conversation” to “poem recitation” according to the speaking intention of “xiaoming_dad”. It can be understood that the chart shown in FIG. 16 is merely an example and not a limitation.
  • the PM module performs a specific update operation.
  • the process can be implemented as follows: The PM module maintains a parameter update The mapping table of the intent and the specific operation interface, so as to determine the corresponding operation API according to the currently identified intent ID.
  • the Update-Costomized-TTS-Parameters-volume interface For example, for the purpose of increasing the volume, it calls the Update-Costomized-TTS-Parameters-volume interface, and its input is the user ID and the adjustment amplitude value; for example, for the intention of correcting the pronunciation of the symbol, it calls the Update-Costomized-TTS-Parameters-pron interface
  • the input is the user ID and the symbol to be corrected, the target pronunciation string, and so on.
  • the PM module executes the relevant update interface and implements the TTS parameter update process described above. If the current user is an unregistered user, the PM module can add a user information record for the strange user, and its associated TTS parameters use the default values, and then update the associated TTS parameters.
  • Step 107 The terminal generates a reply text in combination with the context information.
  • the terminal if the user's speaking intention is to set the TTS parameters, the terminal generates a reply text after setting the TTS parameters based on the user's speaking intention, and the reply text is mainly used to set the terminal's completed TTS parameter setting.
  • the reply text For example, if the user ’s intention indicated by the current user ’s voice input is “to increase the speed of sound” or “to increase the volume”, the preset text corresponding to the setting result may be returned as the reply text.
  • the reply text corresponds to “speak faster”, "The volume has been turned up a bit” and so on.
  • the terminal may combine the content of the user speaking and the context information of the user conversation to generate a reply text for replying to the user. For example, if the content of the user ’s input voice is "What is the weather today?", The terminal may query local network resources or obtain a reply text for replying to the user according to the conversation model. Wait; the content of the user's input voice is "What is the last sentence of someone in the depths of Baiyun", then the terminal can query local network resources or get the reply text "The last sentence of someone in the depths of Baiyun” according to the conversation model. It's 'Far on the Hanshan Stone Trail' ", and so on.
  • the terminal may generate a reply text through the NLG module of the voice response system and the context information in the DM module.
  • the reply text generation can be implemented through retrieval-based, model-based generation, and the like.
  • the specific method can be as follows: prepare a corpus of question and answer pairs in advance, and find the best match between the corpus and the current question when generating the response, and then return the corresponding answer As reply text.
  • a specific method may be: training a neural network model according to a large number of question and answer pairs in advance, and using the question as an input to the neural network model in the process of generating the reply text. , And calculate the corresponding reply answer, and the reply answer can be used as the reply text.
  • Step 108 The terminal determines the TTS parameters required for the current reply text.
  • the terminal can determine the basic TTS parameters associated with the current user identity through the PM module (or basic TTS parameter management module) of the voice response system, such as the preset pitch, preset speed, and preset volume.
  • Basic TTS parameters, and the pronunciation of target characters (strings) in the text can be used to determine the corresponding enhanced TTS parameters based on the content of the reply text and context information , Such as voice emotion parameters, voice scene parameters, background sound effects, etc.
  • the content of the reply text suitable for superimposing the background sound effect may be a poem, a film or television line, or a text with emotional polarity. It should be noted that related content about background sound effects will be described in detail later, and will not be repeated here.
  • Step 109 The terminal selects an acoustic model from a preset acoustic model library according to the current input voice. This step may also be performed before step 108.
  • the terminal is preset with an acoustic model library
  • the acoustic model library may include multiple acoustic models, such as a general acoustic model and several personalized acoustic models, and so on.
  • These acoustic models are all neural network models, and these neural network models can be trained in advance from different corpora.
  • each acoustic model has its own preset information, and these preset information can be used as the basic input information of the acoustic model.
  • the preset information of the general acoustic model may include two or more of the preset sound speed, preset volume, preset pitch, preset tone, preset tone, and preset rhythm of the model; personality
  • the preset information of the acoustic model includes two or more of the preset sound speed, preset volume, preset pitch, preset tone, preset tone, and preset rhythm.
  • Other personalized information such as mantras, ways of responding to specific scenes, types of wisdom, personality types, mixed popular languages or dialects, appellations of specific characters, and other language style characteristics.
  • the acoustic model can convert the reply text into a reply voice according to the preset information and the change information of the preset information.
  • the change information of the preset information referred to here means information such as a basic TTS parameter, an enhanced TTS parameter, a mapping relationship between a target character and a user's preferred pronunciation, and a background sound effect selected in speech synthesis.
  • the speech synthesized through the general acoustic model presents the sound effects under normal and general dialogue scenarios, while the speech synthesized through the personalized acoustic model can "simulate" the sound effects of dialogue scenarios.
  • the implementation method of the dialogue scene of "character imitation" will be described in detail later.
  • the terminal selecting an acoustic model from a preset acoustic model library according to the current input voice includes: the terminal determining an acoustic model preferred by the user according to the identity of the user; The acoustic model selected by the user is selected from a plurality of acoustic models.
  • the terminal selecting an acoustic model from a preset acoustic model library according to the current input voice includes: the terminal determining, according to the content of the current input voice, a relationship related to the content of the user's input voice An acoustic model identification; the identification of the acoustic model is used to uniquely characterize the acoustic characteristics of the acoustic model.
  • the identification of an acoustic model is "Lin Zhiling", indicating that the acoustic model is used to synthesize “Lin Zhiling” type Sound; the identification of an acoustic model is “Little Shenyang Ling”, indicating that the acoustic model is used to synthesize "Little Shenyang” type sounds, and so on.
  • the content of the input speech is related to "Lin Zhiling”
  • the acoustic model with the "Lin Zhiling" logo can be selected.
  • the terminal selecting an acoustic model from a preset acoustic model library according to the current input voice includes: the terminal determining a weight value of each of the plurality of acoustic models according to the identity of the user; Wherein, the weight value of each acoustic model is preset by a user, or the weight value of each acoustic model is determined in advance by learning user preferences. Then, the respective acoustic models are weighted and superimposed based on the weight values to obtain a comprehensive acoustic model (which may be referred to as a fusion model), and the fusion model is selected.
  • a comprehensive acoustic model which may be referred to as a fusion model
  • Step 110 The terminal generates a corresponding calling instruction according to the reply text and the determined TTS parameters.
  • the terminal may generate a call instruction required by the TTS engine according to a reply text, a determined TTS parameter, and the like through an instruction generation module of the voice response system.
  • the corresponding response text is: "The volume has been turned up a bit” ;
  • the TTS parameters determined by the terminal, and the call instruction generated based on the reply text and the determined TTS parameters can be exemplarily referred to the description of the chart shown in FIG. 18, and are not repeated here.
  • Step 111 The terminal performs a speech synthesis operation based on the calling instruction. Specifically, the terminal uses the acoustic model to perform a response to the reply text according to preset information of the acoustic model, the basic speech synthesis information, and the enhanced speech synthesis information. Perform speech synthesis to get the reply speech.
  • the terminal may use the TTS engine of the voice response system to call the acoustic model determined in step S109 to perform a speech synthesis operation, so as to perform a speech synthesis based on the preset information of the acoustic model and related TTS parameters to obtain a reply. voice.
  • the TTS engine may be a system constructed based on a statistical parameter synthesis method, which can fully consider various TTS parameters to synthesize different styles of speech.
  • Step 112. The terminal returns a reply voice to the user.
  • the terminal may play the reply voice to a user through a speaker.
  • the terminal may further display a reply text corresponding to the reply voice through a display screen.
  • the terminal can select different TTS parameters for different users based on the reply text of the dialogue interaction and the dialogue context information, so as to automatically combine user preferences and dialogue scenarios to generate different styles of Reply to the voice, provide personalized speech synthesis effect to different users, greatly improve the voice interaction experience between the user and the terminal, and improve the timeliness of human-machine dialogue.
  • the terminal also allows the user to tune the terminal's voice response system in real time by voice, and update the TTS parameters associated with the user's identity and preferences, making the tuned terminal closer to the user's interaction preferences and maximizing the user's interactive experience.
  • the process includes but is not limited to the following steps:
  • Step S201 This step is a specific refinement of step S104 in the embodiment of FIG. 10 described above.
  • the terminal recognizes that the user's speaking intention is to correct the pronunciation of the target character, such as correcting the polyphony of one or more polyphonic characters.
  • the user ’s speech content is “wrong, should be read as xiao3 qian4, not xiao3 xi1”.
  • the terminal analyzes the recognized text through the NLU module, it recognizes that the keyword “wrong” "Should be read.”
  • the intent recognition module uses these keywords to match the preset sentence template "... read / read / call / speak wrong ... should read / read / call / speak ... not " to determine the current user's speaking intentions To "correct the pronunciation of the target character" (that is, the TTS parameter needs to be updated).
  • Step S202 corresponds to step S105 in the embodiment of FIG. 9 described above, that is, the terminal determines whether the user's speaking intention is to update the TTS parameter.
  • Steps S203 to S205 These steps correspond to step S106 in the embodiment of FIG. 10, that is, the terminal performs an update operation of the TTS parameter indicated by the speaking intention. Steps S203-S205 are described in detail as follows:
  • Step S203 The terminal extracts misreading and target pronunciation.
  • the terminal's intent recognition module may mark "xiao3xixi1" as a misreading pronunciation and "xiao3qian4" as a target pronunciation based on the matched preset sentence template.
  • Step S204 The terminal determines a target word (that is, a target character to be corrected) according to the misreading pronunciation and context information.
  • the terminal's DM module can find the dialogue text output by the terminal in the last round or previous rounds of conversations in the context information, and determine the pronunciation of each word in the dialogue text (such as using an acoustic model to determine the pronunciation ). For example, the output text of the terminal in the last round of conversation was "I'm glad to meet you, Xiao Qian", and the terminal determined that its corresponding pronunciation is "hen3, gao1, xing4, ren4, shi2, ni3, xiao3xi1".
  • the DM module matches the misreading pronunciation with the pronunciation string of the output text, and can determine that the Chinese word corresponding to the misreading pronunciation "xiao3xi1" is "Little Qian", that is, "Little Qian” is the target Word (i.e. the character to be corrected).
  • Step S205 The terminal adds the target word and the target pronunciation to a customized character pronunciation list associated with the identity of the user.
  • the terminal adds the target word " ⁇ ” and the target pronunciation "xiao3 qian4" as new target character-phonetic pairs to a custom character pronunciation table associated with the current user identity through the PM module. Understandably, in future man-machine conversations, when the terminal's response text contains “Xiao Qian", the PM module will determine that the pronunciation of "Xiao Qian” is "xiao3qian4" according to the records of the customized character pronunciation table.
  • the terminal can, in a voice conversation, allow the user to tune the terminal's voice response system through voice in real time, and correct the target character specified by the user (such as polyphonic characters) based on the user's intention To update the TTS parameters associated with the user ’s identity and preferences, making the tuned terminal closer to the user ’s interactive preferences and maximizing the user ’s interactive experience.
  • the target character specified by the user such as polyphonic characters
  • step S108 in the foregoing embodiment of FIG. 10 is described in detail below. Referring to FIG. 21, the process may include the following steps:
  • Step 301 This step is a refinement of step S103 in the embodiment of FIG. 10 described above.
  • the terminal determines whether the user identity of the current user is registered (or whether the identity verification is passed).
  • Step 302. If the terminal determines that the user identity of the current user is registered, read the basic TTS parameters associated with the user.
  • the basic TTS parameters associated with the user "xiaoming_grandma” can be found in the TTS parameter database: the preset coefficient of change of the sound velocity is -40%, and the preset coefficient of change of the volume It is + 40%, and the preset pitch variation coefficient is + 20%.
  • Step 303 If the terminal determines that the user identity of the current user has not been registered (or has not passed identity authentication), it obtains default basic TTS parameters.
  • the current user is xiaohua. Since the identity of "xiaohua" has not been registered and does not exist in the TTS parameter database, the corresponding default values for unregistered users can be returned (as shown in Figure 10, preset sound speed, preset volume, preset Let the pitch change coefficients be 0) as the basic TTS parameter of the current user.
  • Step 304 The terminal compares the reply text with the custom character pronunciation table associated with the current user, and determines whether there are any words / words / symbols matching the custom character pronunciation table in the text, and if so, obtains the word. / Word / symbol The target pronunciation.
  • Step 305 The terminal obtains the speech emotion parameters in the corresponding enhanced TTS parameters from the TTS parameter database according to the reply text.
  • the DM module may be preset with an emotional recommendation model, which is trained based on a large number of dialogue texts with emotional tags. Therefore, the DM module inputs the response text into the emotion recommendation model, and can determine the emotion category (such as happiness, sadness, etc.) and the degree of emotion (such as mild happiness, moderate happiness, etc.) of the current response text. Then, the PM module determines the speech emotion parameters from the emotion parameter correction mapping table of the TTS parameter database according to the emotion recommendation result of the DM module. For example, if the current response text is "That's great” and the emotion recommended by the emotion recommendation model for the response text is "moderately happy", the PM module obtains the "medium” in the emotion parameter correction mapping table shown in FIG. Degree of happiness "corresponding to speech emotion parameters.
  • the emotion category such as happiness, sadness, etc.
  • the degree of emotion such as mild happiness, moderate happiness, etc.
  • Step 306 The terminal obtains the voice scene parameters in the corresponding enhanced TTS parameters from the TTS parameter database according to the reply text and the context information.
  • the DM module may determine the scene of the current conversation according to the context information of the current conversation and the reply text. Furthermore, the PM module can obtain the voice scene parameters in the corresponding enhanced voice parameters according to the determined dialogue scene.
  • the current reply text is a specific seven-character poem (for example, “Mengbo Dongwu Wanli Ship”), and the DM module determines the current dialogue scene as an ancient poem Solitaire scene based on the context information of the dialogue and the reply text.
  • the scene positioning voice scene is "poem recitation”
  • the PM module obtains a voice scene parameter corresponding to "poem recitation" in the scene parameter correction mapping table shown in FIG. 15.
  • the voice scene is positioned as "song humming", and the PM module obtains "song humming" in the scene parameter correction mapping table shown in FIG. 15 "Corresponding voice scene parameters.
  • the context information of the dialogue before the PM module and the reply text determine that it is currently a character imitation scene
  • the voice scene is positioned as "character imitation”
  • the PM module obtains "character imitation” in the scene parameter correction mapping table shown in FIG. 15. Corresponding voice scene parameters, and so on.
  • the terminal can select different TTS parameters for different users (such as basic TTS parameters, user-preferred pronunciation of target characters, and voice emotions) based on the interactive text and dialog context information of the dialog Parameters, voice scene parameters, etc.), so as to automatically combine user preferences and conversation scenarios to generate different styles of reply speech, provide personalized voice synthesis effects to different users, greatly improve the voice interaction experience between users and terminals, and improve human-machine dialogue Timeliness to improve user interaction experience.
  • TTS parameters for different users such as basic TTS parameters, user-preferred pronunciation of target characters, and voice emotions
  • the speech synthesis method of the embodiment of the present invention is described below by taking a speech scene of "poem recitation" as an example. Referring to FIG. 22, the method can be described by the following steps:
  • Step 401 The terminal presets a voice scene parameter of "poem recitation".
  • the TTS parameter database of the terminal is preset with a voice scene parameter of "poem recitation".
  • the speech scene of "Poetry Recitation” focuses on the rhythm of speech, and the speech scene parameters of "Poetry Recitation” are used to adjust the speech pause position / pause time (that is, word segmentation of text content), words or words of input text that conform to a specific syntactic format. Read the length and stress position aloud to strengthen the rhythm. Compared with the natural rhythm of normal dialogues, the enhanced rhythmic rhythm has a clearer and stronger emotional expression. For example, when reading specific poems, nursery rhymes, and other specific syntactic formats, the enhanced rhythmic rhythm can produce "Suppression and frustration" feeling.
  • the voice scene parameters of the "Poetry Recitation" can be realized by a rhythmic rhythm template.
  • a rhythmic rhythm template For each specific literary style (or syntax format) text content, it can correspond to one or more rhythmic rhythm templates.
  • For each rhythmic rhythm template it defines the volume change of the word (i.e. the weight of the word) and the change of the sound length (i.e. the length of the word's pronunciation time) at each position in the template, and the pronunciation of the word in the text.
  • Pause position / pause time ie word segmentation of text content.
  • the word segmentation method can have two methods: “2 words-3 words” and “2 words-2 words-1 word”.
  • the corresponding reading time of each word can be “short-long-short-long” and “short-short-long-long”
  • the pronunciation of each word can be "light-light-light-weight” and "light-weight” Light-light-heavy.
  • the other is training and learning based on the special rhythmic corpus of voice models reading, and based on statistics, machine learning, and deep network frameworks to obtain models including pause positions, word or word reading lengths, and accent positions.
  • the model is trained, the text content that needs to be applied to the "Poetry Recitation" mode is input to the model, and the prosody template corresponding to the text content is obtained.
  • Step 402 The terminal determines that the voice scene of the current conversation is a voice scene of "poem recitation" by replying to text and context information.
  • the terminal may determine that the voice scene of the current conversation is a voice scene of "poem recitation" through the DM module.
  • the manner in which the DM module determines the current scene as a voice scene of "poem recitation” may include the following:
  • the user's input voice contains the user's intention to clearly indicate that the current dialogue is a "poem recitation".
  • the DM module combined the intent recognition module to determine the user's intention, and then determined that the current dialogue is a "poem.”
  • Recitation "of the voice scene. For example, if the user inputs a voice to instruct the terminal to perform Tang poetry recitation or ancient poem Solitaire, then the terminal automatically sets the current dialogue scene as the "poem recitation" voice scene after recognizing the user's intention.
  • the terminal can still determine whether the content of the reply text involves poems, words, songs, fu, etc. through the DM module.
  • One or more of specific literary styles such as five-character quatrains or seven-character quatrains or rhythmic poems, or specific word or song cards.
  • the DM module can search local pre-stored libraries or search libraries in the web server through text search matching or semantic analysis.
  • the library can contain a variety of literary styles corresponding to literary knowledge materials.
  • the DM module further Determine whether the content of the reply text exists in the library, and if so, set the current dialogue scene to the voice scene of "Poetry Recitation".
  • the DM module can analyze punctuation (pauses), word counts, The number of sentences, the order of the number of words in each sentence, etc., match a piece of text or all the text in the reply text with the pre-stored literary style feature. If the match is successful, the piece of text or all text that matches the pre-stored literary style feature. It can be used as the text of the voice scene of "Poetry Recitation".
  • the literary style characteristics of five-character quatrains include: 4 sentences, each sentence is 5 words, a total of 20 words.
  • the literary style features of the five-character poetry include: 8 sentences, each sentence is 5 words, a total of 40 words.
  • the literary style characteristics of the seven-character quatrains include: 4 sentences, each sentence is 7 words, a total of 28 words.
  • the literary style features of Song Ci Xiaoling's "Rumengling” include: 7 sentences, each of which has 6 characters, 6 characters, 5 characters, 6 characters, 2 characters, 2 characters, and 6 characters. If a piece of text in the reply text reads: “The mountains are like daisies outside the window, the classroom is boring. The teacher on the stage speaks at a high speed.
  • the DM module can determine the literary style characteristics In line with the literary style characteristics of "Ru Meng Ling", the current dialogue scene is set as the voice scene of "Poetry Recitation”.
  • Step 403 The terminal determines a voice scene parameter corresponding to the current "poem recitation" voice scene.
  • the terminal determines a voice scene parameter corresponding to the current "poem recitation" voice scene through the PM module.
  • the literary style (or literary style feature) is associated with the rhythm template.
  • the PM module can obtain the prosodic rhythm template associated with it from the TTS parameter database, and the prosodic rhythm template contains the corresponding voice scene Parameters (that is, including rhythmic rhythm change information), specifically, the voice scene parameters include information such as volume changes and sound length changes of words at various positions in the template, and pause positions / pause times of speech in the text (parameters) .
  • the speech scene parameters corresponding to the prosodic rhythm template include specific word segmentation methods, the length of time in which each character is read aloud in each sentence, and information on the pronunciation of each character.
  • the selection of the voice scene parameters may also be closely related to the voice emotion parameters, that is, different emotion categories (such as happiness, sadness), and different emotion levels (such as mild happiness, moderate happiness) Both may affect the voice scene parameters, that is, the specific parameters of the prosody template corresponding to the literary style (or literary style features).
  • voice emotion parameters that is, different emotion categories (such as happiness, sadness), and different emotion levels (such as mild happiness, moderate happiness) Both may affect the voice scene parameters, that is, the specific parameters of the prosody template corresponding to the literary style (or literary style features).
  • the advantage of this design is that the voice scene can be closer to the current voice emotion, which is conducive to the final voice output more vivid and reasonable.
  • the standard parameters include: "2 words-3 words" in the word segmentation method, and the corresponding reading time of each word can be "short- "Short length", the corresponding pronunciation of each word can be "light weight-light weight”.
  • the final speech presentation of the prosodic rhythm template will also be different, and this difference can exist in changes such as hyphenation, tonality, and stress.
  • Table 1 shows a prosodic rhythm template for five-character quatrains, and different phonetic emotions affect the prosodic rhythm template.
  • the voice emotion 1, voice emotion 2, and voice emotion 3 listed in Table 1 may indicate emotion categories (such as happiness, neutral emotion, sadness), and may also indicate emotion levels (such as mild happiness, moderate happiness, and extreme happiness). ). Therefore, for the determined rhythmic template, the PM module can determine the final speech scene parameters from the rules similar to those shown in Table 1 according to the speech emotion parameters of the reply text.
  • a support vector machine (Support Vector Machine) is also adopted by deep learning. Machine (SVM) or deep neural network for model training based on a large number of prosodic rhythm templates corresponding to different speech emotions to obtain a trained deep learning model.
  • SVM Small Vector Machine
  • the terminal can use the standard prosodic rhythm template corresponding to the response text and the response The speech emotion parameters corresponding to the text are input to the deep learning model together to obtain the final speech scene parameters.
  • Step 404 The terminal aligns the rhythmic template of the content of the reply text to facilitate subsequent speech synthesis.
  • the terminal may align the relevant content in the reply text with the rhythmic template of the "poem recitation" voice scene. Specifically, the terminal may combine the pronunciation of the corresponding acoustic model library with the relevant content in the reply text and the parameters of the prosodic rhythm template, and refer to a certain scale to superimpose the parameters of the prosodic rhythm template into these pronunciation segments.
  • the prosody enhancement parameter is ⁇ (0 ⁇ ⁇ 1)
  • the preset volume of the i-th word in the text content is Vi. If the prosodic rhythm feature of the word includes the accent feature , Whose re-reading change is E1, then the final volume of the word is Vi ⁇ (1 + E1) ⁇ (1 + ⁇ ).
  • the basic sound length of the i-th word in the text is Di and the change amount of the sound length is E2, then the final sound length of the character is Di ⁇ (1 + E2).
  • a pause is required between the i-th word and the i + 1-th word, and the pause time is changed from 0s to 0.02s.
  • the reply text includes text content such as "Bai Ri Yi Shan Jin".
  • "Bai Ri Yi Shan Jin” belongs to the first sentence of five-character quatrain poems. If only the general acoustic model is used, The reply text is synthesized by speech, then the synthesized speech (which can be called basic pronunciation segment) is "bai2, ri4, yi1, shan1, and jin4", and the basic pronunciation of each character is 0.1s. The basic pronunciation of each character is the default. The interval is 0.
  • the terminal uses the prosodic rhythm template corresponding to the five-character quatrains in the selection of the TTS parameters, so that in the subsequent process of synthesizing the reply text through the general acoustic model, the five-character quasi-correspondence is additionally used
  • the rhythmic rhythm template superimposes this basic pronunciation segment, so that in the final synthesized speech, as shown in FIG. 23, in terms of reading time, the length of the pronunciation of different words in the segment is lengthened to different degrees, respectively.
  • the following describes a speech synthesis method according to an embodiment of the present invention by using a voice scene of "song humming (take a nursery rhyme as an example)" as an example.
  • the method can be described by the following steps:
  • Step 501 The terminal presets a voice scene parameter of "child songs humming".
  • the TTS parameter database of the terminal is preset with voice scene parameters of “Children's Song Humming”.
  • time is divided into equal basic units, and each basic unit is called a "beat" or a beat.
  • the time value of the beat is expressed by the time value of the note.
  • the time value of a beat can be a quarter note (that is, a quarter note is a beat), a half note (a quarter note is a beat), or eight. Quarter note (takes eighth note as a beat).
  • the rhythm of music is generally defined by beats, such as 4/4 beats: 4/4 beats are 4 quarter notes as a beat, and 4 beats per measure can have 4 quarter notes.
  • the so-called preset "song song humming" voice scene parameters that is, preset a variety of types of children's songs beat type, and the text segmentation of the content of the reply text that needs to be synthesized in a "song song humming" manner.
  • the beat of the children ’s song may be determined according to the number of words in the two punctuation points or the number of words in each field after the word segmentation.
  • the beat of the children ’s song may be determined according to the number of words in the two punctuation points or the number of words in each field after the word segmentation.
  • One way is to cut the reply text according to the punctuation marks, that is, to identify the punctuation marks in the reply text, and the number of words in each field divided by each punctuation mark is "3,3,7,8,3,8" It can be seen that the field with the word number "3" appears the most, so it can be determined that the beat that most closely matches the reply text is a multiple of 3, such as 3/3 beat, 3/4 beat, and so on.
  • the word segmentation result is, for example, "little / swallow / wear / flower clothing / yearly / spring / come / here / to / question / swallow / you / why / come / swallow / Say / here / of / spring / most / beautiful ", in order to maintain semantic coherence, the results of the segmentation can be adjusted, and the verbs, adjectives and adverbs that modify the noun are connected with the modified noun and merged into one word.
  • the previous participle result further changed to "little swallows / wearing flowers / yearly / spring / coming here / to / questioning swallows / why are you / coming / swallows saying / here / spring / most beautiful", divided
  • the number of words in each subsequent field is "3,3,2,2,3,1,3,3,1,3,3,2,3", as can be seen, the field with the word number "3" appears the most Therefore, it can be determined that the beat that most closely matches the reply text is a multiple of 3, such as 3/3 beat, 3/4 beat, and so on.
  • Step 502 The terminal determines that the voice scene of the current conversation is a voice scene of "humor songs" by replying to text and context information.
  • the terminal may determine, through the DM module, that the voice scene of the current conversation is the voice scene of "child songs and humming".
  • the manner in which the DM module determines the current scene as a voice scene of "Children's Song Humming" may include the following:
  • the user's input voice contains the user's intention to clearly indicate that the current conversation is a “child song humming”.
  • the current conversation is determined as "Song of children's songs”. For example, if the user inputs a voice to instruct the terminal to sing children's songs, then after the terminal recognizes the user's intention, it automatically sets the current dialogue scene to the voice scene of "children's humming".
  • the terminal can still determine whether the content of the reply text involves the content of children's songs through the DM module.
  • the DM module can search the local pre-stored nursery rhyme library or search the nursery rhyme library in the web server through text search matching or semantic analysis.
  • the lyric library can contain various lyrics of the nursery rhymes, and the DM module judges Whether the content of the reply text exists in these children's song lyrics, and if it exists, the current dialogue scene is set to the voice scene of "Children's Song Humming".
  • Step 503 The terminal determines a voice scene parameter corresponding to the current "Children's Song Mode".
  • the terminal determines a voice scene parameter corresponding to the current "children's song mode" through a PM module.
  • the PM module may determine a text segmentation method according to the content of the reply text (refer to the two methods described in step 502 above), and use this method to perform text segmentation on the reply text to obtain a segmentation result. Then, the best matching beat is determined according to the segmentation result.
  • Step 504 The terminal aligns the content of the reply text to facilitate subsequent speech synthesis.
  • the terminal may align the content of the reply text with the determined beat through the PM module, so as to ensure that each field of the text is fused with the changing rule of the rhythm of the nursery rhyme. Specifically, the terminal aligns the cut text field with the time axis according to the change rule of the beat.
  • the 3 words can be aligned with the 3 beats in a measure.
  • the number of words in a field in the reply text is less than the number of beats in the measure. If the field is 2 words and the beat is 4/4 beats, then search for adjacent text fields before and after the field. The number of words in the field before the field (or the field after the field) is 2, then you can merge this field with the field before the field to align the 4 beats in the measure together. If the fields before and after cannot be merged, or the number of words after the merge is still less than the number of beats, you can further align the beats in the following ways.
  • One way is to fill the text with less than the number of ticks with a blank. Specifically, if the number of words matched with one bar of music is less than the number of beats, as long as each word corresponds to the position of one beat in time during matching, the remaining part is filled with mute. As shown in (a) of FIG. 25, for the field "Little White Rabbit" in the reply text, the matching beat type is 4/4 beat, then the "small”, “white”, and “rabbit” can be aligned with the The first, second, and third beats, and finally the mute to complete the fourth beat. It should be noted that this figure only shows an implementation situation. In actual operation, the mute may be any one of the first to fourth beats.
  • Another way is to align the rhythm by lengthening the sound length of a word.
  • the purpose of aligning the words and the beats can be achieved by lengthening the reading time of one or more words.
  • the matching beat type is 4/4 beats, then the "small” and “white” can be aligned with the first beat in the measure, respectively. 2.
  • stretch the pronunciation of "rabbit” so that "rabbit” is aligned with the third and fourth beats. It should be noted that this figure only shows an implementation situation. In actual operation, the object of the elongation processing of the pronunciation may be any word in "Little White Rabbit".
  • Another way is to lengthen the sound length of each word evenly to ensure the overall time alignment.
  • a method of extending the pronunciation time of each character in the text field averagely may be used to align the pronunciation time of the character with the beat of the music.
  • the matching beat type is 4/4 beats, so the reading time of each word can be lengthened to 4/3 beats. The duration can ensure that the entire field is aligned to the beat.
  • the following uses the acoustic model for implementing "person imitation” as an example to describe the speech synthesis method of the embodiment of the present invention. Referring to FIG. 26, the method can be described by the following steps:
  • Step 601 The acoustic model library of the terminal presets an acoustic model for implementing "character imitation".
  • the acoustic model library of the terminal is preset with various acoustic models (i.e., personalized acoustic models) for implementing "character imitation".
  • the acoustic model of "character imitation” can be used to make the synthesized speech have the sound characteristics of a specific character. Therefore, the information of the preset voice, preset tone, and preset rhythm of the acoustic model of "character imitation” and this information of the general acoustic model will be There are differences.
  • the character imitated by the "personal imitation” acoustic model may be the user's favorite personal image, a character in a film or television work, or a combination of multiple preset sound modes and user preferences.
  • the acoustic model of “imitation” can be an acoustic model in which the user imitates the user's own speaking style; it can also be an acoustic model that mimics the speaking characteristics of other characters, for example, an acoustic model used to imitate “Lin Zhiling / Soft Voice”, which can imitate “Little Shenyang /
  • the acoustic model of "funny sound” can be an acoustic model that mimics "Andy Lau / Deep Voice", and so on.
  • the terminal does not select a specific acoustic model in the acoustic model library, but a comprehensive model of multiple acoustic models in the acoustic model library.
  • acoustic model library in addition to the acoustic models that can preset some specific character sound characteristics, different voice characteristics and different language style characteristics can be combined according to user preferences or needs, so as to form individual characteristics.
  • Acoustic model the characteristics of speech include the speed of speech (sound velocity), intonation, rhythm, tone color, etc.
  • the change in tone color is that in addition to a 'basal tone', the sound naturally adds many different 'sound frequencies' and overtones' 'Interlaced', determines different tones, so that people can distinguish different sounds after listening.
  • the characters represented by these different sounds can be natural persons (such as users, sound models, etc.), or they can be animated characters or virtual characters (such as robot cats, Luo Tianyi, etc.).
  • Linguistic style features include mantras (including common mood words), response characteristics to specific scenes, types of wisdom, personality types, popular languages / dialects mixed in speech, and titles of specific characters. That is to say, the acoustic model that combines different voice characteristics and different language style characteristics according to the user's preferences or needs, in addition to the preset information includes preset sound speed, preset volume, preset pitch, preset Two or more of the tone color, preset intonation, and preset prosody rhythm also include language style characteristics.
  • User mantras are sentences that users are used to saying intentionally or inadvertently. For example, in the mood of surprise, some people will add a sentence "Are you mistaken?" In front of a sentence, and some people are often in the middle of sentences Add uncertain words such as "may” and "maybe". In addition, the mantra may also include common mood words, such as the iconic mood word " ⁇ " of the comedian Xiaoyang, which often appears at the end of sentences.
  • the response to a specific scenario refers to the response most commonly used by a person in a specific scenario or to a specific question. For example, to a question like “Where to eat”, a person ’s response to a particular scenario may be “any casual”; for another question to a question “What beer do you want”, a person ’s response to a specific scenario may be “Tsingtao Beer”, etc. Wait.
  • the type of wisdom is used to distinguish the tendency of different groups to understand the different ways of presenting content.
  • the type of wisdom further includes the following types: language intelligence type, such people have strong reading ability, like to read text descriptions, play word games, and be good at it Write poems or write stories; logical and mathematical intelligence types, such people are more sane, good at calculations, and sensitive to numbers; music intelligent types, such people are sensitive to melody and sound, like music, and learn more efficiently when there is music in the background High; space intelligence type, such people are sensitive to the surrounding environment, like to read charts, good at painting; sports intelligence type, such people are good at using their own body, like sports, hands-on production; intelligent type of interpersonal relationship, such people Good at understanding and communicating with others; introspective intelligence types, such people like to think independently and set their own goals; natural observer intelligence types, such people are interested in natural creatures on the planet.
  • Personality types refer to different language styles corresponding to people with different personalities. For example, a person with a stable personality has a strict language style; a person with a lively personality has a humorous language style; a person with an introverted personality has a euphemistic language, and so on.
  • Speaking mixed dialects mean that a person likes to mix with native dialects or foreign languages when speaking. For example, when you are willing, you like to use Cantonese “ ⁇ ” or English “Thank you”.
  • the inclusion of popular language in speaking means that when a person speaks, he or she likes to use the current popular words or network words instead of specific words. For example, when a person is sad, they say "blue thin mushrooms" instead of "uncomfortable”.
  • the title of a specific person refers to the use of a specific name for a specific person, for example, the user calls a specific person Wang Huaweing as “Mr. Wang” or “Lao Wang” and so on.
  • the voice response system of the terminal can obtain the voice characteristics and language style characteristics associated with the user identity through learning.
  • user preferences can be acquired and analyzed through feature migration in advance, that is, user needs can be determined according to the user's acquisition of information in other dimensions, thereby further inferring and judging the voice features that users may like. And language style characteristics.
  • the characteristics of the song that the user likes can be analyzed and counted, and the speed of the speech (sound rate) of the synthesized speech and the strength of the rhythm are determined according to the rhythmic strength of the song; the synthesis is determined according to the voice characteristics of the singer corresponding to the song The timbre characteristics of speech; determine the linguistic style characteristics of synthesized speech according to the style characteristics of the lyrics of the song.
  • it can analyze and count the features of the user's favorite TV programs, social media content and other dimensions, and train a feature transfer model, so that the model can be used to infer the user's favorite voice features and language style features.
  • the terminal's voice response system can also obtain and analyze user preferences through multi-modal information, that is, by analyzing the user's expressions, attention levels, and operating behaviors, it automatically analyzes and infers the user's response to synthesis. Preference or demand for phonetic features.
  • multi-modal analysis not only can the user's demand for synthesized speech be collected before generating personalized synthesized speech, but also the user's preference for the speech can be continuously tracked after the personalized speech is generated, and the synthesis can be iteratively optimized based on this information Features of speech.
  • the users can indirectly obtain the user's preference for different voices; for example, by analyzing the degree of attention of users when they hear different synthesized voices (the degree of attention can be Obtained through the user's facial expression information, or through the EEG or bioelectric signals obtained by the user's wearable device) to indirectly obtain the user's preference for different voices; for example, the user's Operating habits (such as skipping a voice or playing a voice quickly may indicate that the user does not like the voice very much) to indirectly obtain the user's preference for different synthesized voices.
  • the degree of attention of users when they hear different synthesized voices the degree of attention can be Obtained through the user's facial expression information, or through the EEG or bioelectric signals obtained by the user's wearable device
  • the user's Operating habits such as skipping a voice or playing a voice quickly may indicate that the user does not like the voice very much
  • the following describes the acoustic model with the sound characteristics of a specific character and a comprehensive model (or fusion model) obtained by fusing multiple acoustic models.
  • Models can be fused, for example, an acoustic model that imitates "Lin Zhiling / Soft Voice” and an acoustic model that imitates "Little Shenyang / Funny Voice”; for example, the user's own voice characteristics, language style characteristics, or users' favorite
  • the vocal characteristics and language style characteristics of the character image are combined with the sound models (such as the acoustic model of "Lin Zhiling / Soft Voice" and the acoustic model of "Little Shenyang / Funny Voice") corresponding to the character image in some film and television works to obtain
  • the final acoustic model is used for subsequent speech synthesis.
  • the sounds of multiple personalized acoustic models in the acoustic model library can be used to achieve thick, soft, cute, funny and other types of sounds.
  • the terminal After acquiring the user's preferences or needs for the voice (these preferences or needs are directly related to the identity of the user), the terminal determines the user's respective preference coefficients for the several acoustic models, and these preference coefficients represent the weights of the corresponding acoustic models
  • the weight value of each acoustic model is manually set in advance by a user according to his own requirements, or the weight value of each acoustic model is automatically determined by the terminal in advance by learning user preferences. Then, the terminal may perform weighted superposition on the respective acoustic models based on the weight value, so as to obtain a comprehensive acoustic model by fusion.
  • the terminal may select features of one or several dimensions with the highest user preferences or needs according to the voice characteristics and language style characteristics that the user likes, and select the features in multiple acoustic models.
  • the sound is matched to determine the user's favorite coefficients for the sounds of different acoustic models.
  • the sound characteristics of each acoustic model are combined with the corresponding favorite coefficients to obtain the final voice scene parameters.
  • the table shown in FIG. 27 exemplarily gives the sound characteristics corresponding to various sound types (thick, soft, funny). It can be seen that different sound types have corresponding sound characteristics. Speech speed, intonation, rhythm, and timbre are different.
  • the terminal obtains the user's preferences or needs for voice, the user can also directly match the sounds of multiple acoustic models according to the user's identity (that is, the user's preferences or needs are directly tied to the user's identity), thereby determining the user.
  • the favorite coefficients for thick, soft, cute, funny, and other sound types are 0.2, 0.8, and 0.5 respectively, that is, the weights of these acoustic models are 0.2, 0.8, and 0.5, respectively.
  • the speed, speed, intonation, The final acoustic model (that is, the fusion model) can be obtained by performing weighted superposition of prosody, timbre, and the like.
  • the synthesized speech scene parameters realize the sound conversion of the acoustic model in terms of speech rate, intonation, rhythm, and timbre, which is conducive to producing mixed sound effects such as "talking Lin Zhiling" or "rapper mode Lin Zhiling".
  • the embodiment of the present invention is not limited to using the above-mentioned method to obtain a comprehensive model of multiple acoustic models (abbreviated as a fusion model).
  • a comprehensive model of multiple acoustic models (abbreviated as a fusion model).
  • a voice request is made to form the final acoustic model.
  • the terminal may provide a graphical user interface or a voice interaction interface, and the user may select parameters of each voice feature and parameters of language style features according to his preference, as shown in FIG. 28, FIG. 28 A selection interface of parameters of speech features and parameters of language style features is shown.
  • the user selects the speech feature corresponding to the acoustic model of the "Lin Zhiling" sound, that is, the "speech rate, intonation, rhythm, tone color” in the speech feature corresponding to the "Lin Zhiling" type acoustic model.
  • the parameter values of the sub-parameters are used as the parameter values of the sub-parameters such as "speech rate, intonation, prosody, and tone color" in the speech features corresponding to the fusion model.
  • the user selects the language style feature as the language style feature corresponding to the acoustic model of the "Little Shenyang” sound, which is also the language style feature corresponding to the acoustic model of the "Little Shenyang” sound.
  • the parameter value of the child parameter is also the language style feature corresponding to the acoustic model of the "Little Shenyang” sound.
  • the user may send a text or voice request to the terminal in advance "Please use Lin Zhiling's voice to speak in Xiao Shenyang's language style", then the terminal's voice response system resolves that the user's settings are intended to integrate the voice characteristics of the fusion model into the Speech speed, intonation, prosody, and timbre are set to the relevant sub-parameter values of the speech features of the acoustic model of the "Lin Zhiling" sound, and the mantra, response to specific scenes, type of intelligence, personality type, and The value of the relevant sub-parameters for the language style feature of the acoustic model with the dialect / popular language set to the sound of "Little Shenyang".
  • the terminal may also determine the acoustic model preferred by the user according to the identity of the user, so that the terminal can directly select all acoustic models from the acoustic model library during the sound synthesis process. Describe the user's favorite acoustic model.
  • the acoustic model preferred by the user is not necessarily the personalized acoustic model originally set in the acoustic model library, but may be an acoustic model obtained by fine-tuning parameters of a personalized acoustic model according to the preference of the user.
  • the sound characteristics of a personalized acoustic model originally set in the acoustic model library include a first speech speed (speed of sound), a first intonation, a first rhythm, and a first tone color.
  • the terminal determines the user's favorite various parameter combinations through analysis of user preferences or manual settings by the user: 0.8 times the first speech speed, 1.3 times the first intonation, 0.9 times the first rhythm, and 1.2 times the first feminization Tone color, so that these parameters are adjusted accordingly to obtain a personalized acoustic model that meets user needs.
  • Step 602 The terminal determines, through the input voice of the user, that the current dialogue needs to adopt an acoustic model of "character imitation.”
  • the terminal may determine, through the DM module, a scene in which the dialogue of the current dialogue needs to be set to "character imitation".
  • the manner in which the DM module determines the current scene as a voice scene of "character imitation” may include the following:
  • the user's input voice contains the user's intention to clearly indicate that the current dialogue is "character imitation”.
  • the current dialog is determined to be "character imitation" "Scene. For example, if the user inputs a voice to instruct the terminal to speak with Lin Zhiling's voice, then after the terminal recognizes the user's intention, the terminal automatically sets the current dialogue scene as a "character imitation" scene.
  • the terminal can still determine whether the content of the input text corresponding to the user's input voice involves character imitation through the DM module.
  • Content can be determined, such as full-text matching, keyword matching, and semantic similarity matching. These contents include lyrics, sound effects, movie lines, and animation dialogue scripts.
  • the method of full-text matching means that the input text is exactly the same as a part of the corresponding movie or music work
  • the method of keyword matching means that the input text is the same as a part of keywords of a movie or music work
  • the way of semantic similarity matching is Refers to the similarity between the input text and a part of the film or music works.
  • the input text is "He has been the protagonist, he said that daydreaming is not wrong, and people without dreams are salted fish. On this way of fighting for dreams, I will gain something after working hard. That ’s enough. ”
  • the input text is the matching content
  • the matching content is the line“ Shaolin Football ”
  • Step 603 The terminal obtains an acoustic model corresponding to "character imitation" from an acoustic model library.
  • the terminal may select a certain acoustic model or a certain fusion model from the acoustic model library according to user preference.
  • the terminal determines a sound mode identifier related to the content of the current input voice according to the content of the current input voice, and selects an acoustic model identifier corresponding to the sound mode identifier from the acoustic model library.
  • Acoustic model For example, the terminal may determine that the currently synthesized speech needs to use a "Chou Xingchi" type of sound according to the input text or user preference or reply text, and then select an acoustic model of the "Chou Xingchi" type of sound from the acoustic model library.
  • the terminal determines a weight value (that is, a preference coefficient) of each acoustic model in the plurality of acoustic models; Wherein, the weight value of each acoustic model is set in advance by the user, or the weight value of each acoustic model is determined in advance according to the preference of the user; and then each acoustic model is based on the weight value
  • the fusion is performed to obtain a fusion acoustic model.
  • Step 604 The terminal performs subsequent speech synthesis by using the selected acoustic model.
  • the terminal may have originally synthesized speech as "Let's eat at XX tonight".
  • the terminal uses the fusion model of the selected "Lin Zhiling” acoustic model and the “little Shenyang” acoustic model, and the final synthesized speech is "Do you know? Eat in XX place tonight, alas” .
  • the speech features in the output speech use the relevant parameters of the "Lin Zhiling" acoustic model, thus reflecting the soft features of the synthesized speech.
  • the language style features in the output speech use the relevant parameters of the "Little Shenyang” acoustic model, thus reflecting the witty and funny characteristics of the synthesized speech.
  • the synthesized speech output in this way achieves the synthesis effect of "speaking in the language style of Shenyang with Lin Zhiling's voice".
  • a step up also uses a rhythmic rhythm template similar to that shown in FIG. 23, which not only completes the real-time voice interaction with the user, but also meets the user's personalized needs and improves the user experience.
  • a background sound effect may also be superimposed when outputting the synthesized speech.
  • the scenario of superimposing "background sound" on the synthesized speech is used as an example to describe the speech synthesis method of the embodiment of the present invention. Referring to FIG. 29, the method can be described by the following steps:
  • Step 701 The terminal presets a music library.
  • a music library is preset in the TTS parameter library of the terminal, and the music library includes multiple music files, and these music files are used to provide a background sound effect during the speech synthesis process.
  • the background sound effect is specifically Refers to a certain piece of music (such as pure music or songs) or sound effects (such as film and television sound effects, game sound effects, language sound effects, animation sound effects, etc.).
  • Step 702 The terminal determines that the reply text has content suitable for superimposing background music.
  • the terminal may determine, through the DM module, content suitable for superimposing background music.
  • the content suitable for superimposing background music can be words with emotional polarity, can be poetry, can be film and television lines, and so on.
  • the terminal can identify the sentiment-oriented words in the sentence through the DM module, and then determine the emotional state of the phrase, sentence, or the entire reply text through methods such as grammatical rule analysis and machine learning classification.
  • the emotional dictionary can be used to identify these emotionally inclined words.
  • the emotional dictionary is a collection of words, and the words in the collection have obvious emotional polarity tendencies, and the emotional dictionary also contains the polarity information of these words.
  • the words in the dictionary are identified with the following emotional polarity: happy, like, sadness, surprise, angry, fear, disgust, and other emotions Polarity type.
  • emotional polarity can be further divided into multiple levels of emotional intensity (for example, divided into five levels of emotional intensity).
  • Step 703 The terminal determines a background sound effect to be superimposed from the music library.
  • the terminal determines the background sound effect to be superimposed in the TTS parameter database through the PM module.
  • the terminal sets the identification of the emotional polarity category for different segments (ie, sub-segments) of each music file in the music library in advance. For example, these segments are identified as the following emotional polarity types: happy, like , Sadness, surprise, anger, fear, disgust, etc. Assuming that the current response text includes texts with emotional polarity, after determining the emotional polarity categories of these texts in step 702, the terminal searches the music library through the PM module for a music file with the corresponding emotional polarity category identification.
  • the emotional polarity category and the identification of the emotional intensity are set for each sub-segment in the music library in advance, then it is determined in step 702 After the emotional polarity category and emotional intensity of these texts, a sub-segment combination with the corresponding emotional polarity category and emotional intensity logo is found in the music library as the final selected background sound effect.
  • the terminal searches the music library through the PM module for the pure music or song or music special effect related to the content of the poem / word / music. If it can be found, Use the pure song or song as the background sound effect to be superimposed.
  • the identification of the emotional polarity category is set for each background sound effect in the music library beforehand, after determining the emotional polarity category of the content of the poem / word / music included in the reply text, you can find in the music library The background sound effect identified by the corresponding emotional polarity category.
  • an emotional polarity category and an identification of the emotional intensity are set in advance for each background sound effect in the music library. After the emotional polarity category and emotional intensity of the content of the poem / word / music, find the background sound effect with the corresponding emotional polarity category and emotional intensity logo in the music library.
  • the terminal can search the music library through the PM module for pure music or songs or music special effects related to the acoustic model imitated by the character, such as the imitated character
  • the terminal can search the music library through the PM module for pure music or songs or music special effects related to the acoustic model imitated by the character, such as the imitated character
  • the voice mode “Little Shenyang”
  • you can follow the dialogue scene or the content of the reply text Select a song clip from the song as the final background sound.
  • Step 704 The terminal aligns the background sound effect determined by the reply text alignment to facilitate subsequent speech synthesis.
  • the terminal can split the content in the reply text that needs to be superimposed with background sound effects into different parts (split or punctuated according to punctuation), and each part can be called a sub-content. Emotional polarity types and emotion intensity. Furthermore, after the background sound effect matched with the content is determined, the content is aligned with the matched background sound effect, that is, the emotional change of the content is basically consistent with the emotional change of the background sound effect.
  • the reply text is "The weather is good, the national football team has won again, so happy.”
  • the entire content of the reply text needs to be superimposed with background sound effects.
  • the reply text is split into “Weather Yes, "The national football team won again,” "Happiness” is a sub-content of the three parts, and the emotional polarity category of each part is happy, and the emotional intensity is 0.48, 0.60, 0.55 (from the figure (Indicated by black dots in the lower half), the total length of the pronunciation of each part is 0.3s, 0.5s, 0.2s.
  • step 703 a music file whose emotional polarity category is happy has been initially determined.
  • the emotional change track of the music file can be calculated and counted to obtain the emotional intensity of each part of the music.
  • the waveform shown in Figure 30 represents a piece of music.
  • the music can be divided into 15 small fragments, each of which has a sound length of 0.1s. According to the parameters such as the sound intensity and rhythm of each small fragment, through fixed rules or classifiers Calculate the emotional intensity of each small segment.
  • the emotional intensity of these 15 small segments are: 0.41,0.65, 0.53, 0.51,0.34, 0.40, 0.63, 0.43, 0.52, 0.33, 0.45, 0.53, 0.44, 0.42 0.41 (indicated by the black dot in the upper half of the figure).
  • the total sound length is 0.3s
  • the maximum emotional intensity is 0.51 (originated from the emotional intensity of the fourth segment 0.51);
  • the sub-segments composed of 7, 8, 9, 10, and 11 segments have a total sound length of 0.5s, and the maximum emotional intensity is 0.63 (derived from the emotional intensity of the seventh segment 0.63);
  • the sub-segment composed of small fragments has a total sound length of 0.2s, and the maximum emotional intensity is 0.53 (originated from the emotional intensity of the fourth fragment of 0.53).
  • the emotional changes of the three sub-segments are basically consistent with the emotional changes of the sub-contents of the three parts of the reply text (for example, the change trajectories of the two polylines in the illustration are basically the same), so the three The music segment composed of the sub-segments is the background sound effect that matches the reply text. Therefore, you can align the three sub-segments of "the weather is good,” “the national football team has won again,” and "happy" in the complex text, so as to produce the "speech superimposed background sound” effect in the subsequent speech synthesis process.
  • FIG. 31 is a schematic structural diagram of a speech synthesis device 200 according to an embodiment of the present invention.
  • the device 200 may include one or more processors 2011, one or more memories 2012, and audio circuits. 2013.
  • the device 200 may further include an input unit 2016, a display unit 2019, and other components.
  • the processor 2011 may be connected to the memory 2012, the audio circuit 2013, the input unit 2016, and the display unit 2019 through a bus, respectively. They are described as follows:
  • the processor 2011 is a control center of the device 200, and uses various interfaces and lines to connect various components of the device 200.
  • the processor 2011 may further include one or more processing cores.
  • the processor 2011 may perform speech synthesis by running or executing software programs (instructions) and / or modules stored in the memory 2012, and calling data stored in the memory 2012 (such as executing each of the embodiments in FIG. 4 or FIG. 9). Functions of this module and processing data) to facilitate real-time voice conversation between the device 200 and the user.
  • the memory 2012 may include a high-speed random access memory, and may further include a non-volatile memory, such as at least one magnetic disk storage device, a flash memory device, or other volatile solid-state storage device. Accordingly, the memory 2012 may further include a memory controller to provide the processor 2011 and the input unit 2017 with access to the memory 2012.
  • the memory 2012 may be specifically used to store software programs (instructions) and data (relevant data in the acoustic model library, relevant data in the TTS parameter library).
  • the audio circuit 2013 may provide an audio interface between the device 200 and a user, and the audio circuit 2013 may further be connected with a speaker 2014 and a microphone 2015.
  • the microphone 2015 can collect the user's sound signals, and convert the collected sound signals into electrical signals, which are received by the audio circuit 2013 and converted into audio data (that is, forming the user's input voice), and then the audio data is transmitted to the processor. 2011 performs voice processing.
  • the processor 2011 synthesizes the reply voice based on the user's input voice and transmits it to the audio circuit 2013.
  • the audio circuit 2013 can convert the received audio data (that is, the reply voice) into an electrical signal. It is further transmitted to the speaker 2014, which is converted into a sound signal output by the speaker 2014, so that the reply voice is presented to the user, thereby achieving the purpose of real-time voice conversation between the device 200 and the user.
  • the input unit 2016 may be used to receive digital or character information input by a user, and generate a keyboard, mouse, joystick, optical or trackball signal input related to user settings and function control.
  • the input unit 2017 may include a touch-sensitive surface 2017 and other input devices 2018.
  • the touch-sensitive surface 2017 is also referred to as a touch display screen or a touchpad, which can collect user's touch operations on or near it and drive the corresponding connection device according to a preset program.
  • other input devices 2018 may include, but are not limited to, one or more of a physical keyboard, function keys, trackball, mouse, joystick, and the like.
  • the display unit 2019 can be used to display information input by the user or information provided by the device 200 to the user (such as the relevant identification or text for replying to speech) and various graphical user interfaces of the device 200. These graphical user interfaces can be composed of graphics, text, and icons. , Video, and any combination thereof.
  • the display unit 2019 may include a display panel 2020.
  • the display panel 2020 may be configured by using a liquid crystal display (Liquid Crystal Display, LCD), an organic light emitting diode (Organic Light-Emitting Diode, OLED), and the like.
  • the touch-sensitive surface 2017 and the display panel 2020 are two separate components, in some embodiments, the touch-sensitive surface 2017 and the display panel 2020 may be integrated to implement input and output functions.
  • the touch-sensitive surface 2017 may cover the display panel 2020.
  • the touch-sensitive surface 2017 detects a touch operation on or near the touch-sensitive surface 2017, it is transmitted to the processor 2011 to determine the type of the touch event, and the processor 2011 then A corresponding visual output is provided on the display panel 2020.
  • the device 200 in the embodiment of the present invention may include more or fewer components than shown, or some components may be combined, or different components may be arranged.
  • the device 200 may further include a communication module, a camera, and the like, and details are not described herein again.
  • the processor 2011 may implement the speech synthesis method of the embodiment of the present invention by running or executing a software program (instruction) stored in the memory 2012 and calling data stored in the memory 2012, including: the processor 2011 according to a user The current input voice to determine the identity of the user; obtain an acoustic model from the acoustic model library according to the current input voice, and the preset information of the acoustic model includes a preset sound speed, a preset volume, a preset pitch, Two or more of a preset tone color, a preset intonation, and a preset prosody rhythm; determining basic speech synthesis information from the speech synthesis parameter database according to the identity of the user, the basic speech synthesis information including the The amount of change in one or more of the preset sound speed, the preset volume, and the preset pitch; determining a reply text based on the current input voice; and synthesizing a parameter database from the speech based on the reply text and context information Determining enhanced speech synthesis information including
  • the memory 2012 may further be used to store these software modules, and the processor 2011 may be used for the software programs in the memory 2012 (Instructions) and / or these software modules, and calling data stored in the memory 2012 to perform speech synthesis.
  • FIG. 31 is only an implementation manner of the speech synthesis device of the present invention, the processor 2011 and the memory 2012 in the device 200 may be integratedly deployed in a possible embodiment.
  • the computer program product includes one or more computer instructions.
  • the computer may be a general-purpose computer, a special-purpose computer, a computer network, or other programmable device.
  • the computer instructions may be stored in a computer-readable storage medium, or transmitted from one computer-readable storage medium to another computer-readable storage medium.
  • the computer instructions may be from a network site, computer, server, or data center.
  • the computer-readable storage medium may be any available medium that can be accessed by a computer, and may also be a data storage device such as a server, a data center, and the like that includes one or more available medium integration.
  • the usable medium may be a magnetic medium (such as a floppy disk, a hard disk, a magnetic tape, etc.), an optical medium (such as a DVD, etc.), or a semiconductor medium (such as a solid state hard disk), and the like.

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Computational Linguistics (AREA)
  • Machine Translation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

A speech synthesis method and a related device. The method comprises: determining a user identity according to a current input speech of the user; acquiring an acoustic model from an acoustic model library (1033) according to the current input speech; determining basic speech synthesis information according to the user identity, the basic speech synthesis information representing variations in a preset sound speed, a preset volume, and preset pitch of the acoustic model; determining a reply text; determining enhanced speech synthesis information according to the reply text and contextual information, the enhanced speech synthesis information representing variations in a preset timbre, an intonation, and a preset rhythm of the acoustic model; and using the acoustic model to perform speech synthesis on the reply text according to the basic speech synthesis information and the enhanced speech synthesis information so as to acquire a speech used as a reply to the user. The method enables a device to provide a personalized speech synthesis effect for a user in a process of human-computer interaction, thereby improving the speech interaction experience of the user.

Description

语音合成方法及相关设备Speech synthesis method and related equipment 技术领域Technical field
本发明涉及语音处理领域,尤其涉及语音合成方法及相关设备。The invention relates to the field of speech processing, in particular to a speech synthesis method and related equipment.
背景技术Background technique
近年来,人机对话开始广泛进入人们日常生活,常见的场景包括智能客服机器人、智能音箱、聊天机器人等。人机对话的核心在于机器能够在所建的系统框架下,根据事先训练或者学习的数据,自动对用户输入的语音进行理解和分析,并给出有意义的语音回复。在设计中文文本的语音合成系统时,如果仅仅是将输入的文字一一匹配读音库,并将所有字的读音串联起来形成语音输出,那么这样的语音就会很机械生硬,不带语调起伏,听觉体验很差。近年来的发展的TTS(text–to-speed)引擎是一种建立在阅读规则上的语音合成技术,采用TTS引擎进行语音合成能够在单字/词的连接过渡和语气的转折上处理得比较自然,使得机器答复的语音更加贴近人类的说话声音。In recent years, human-computer dialogue has begun to widely enter people's daily life. Common scenarios include intelligent customer service robots, smart speakers, chat robots, and so on. The core of human-computer dialogue is that the machine can automatically understand and analyze the voice input by the user based on the data of the training or learning under the framework of the built system, and give a meaningful voice response. When designing a speech synthesis system for Chinese text, if you just match the input text with the pronunciation database one by one, and connect the pronunciation of all characters in series to form a speech output, then such speech will be mechanically stiff, without inflections. The hearing experience is poor. The TTS (text-to-speed) engine developed in recent years is a speech synthesis technology based on reading rules. Using the TTS engine for speech synthesis can handle the natural transition of single words / words and the transition of mood. , Making the machine answer the voice closer to the human voice.
而现今,现有技术中仅局限于在人机交互的过程中使机器“说话声音像人类”,而并未考虑用户对于人机交互的多样化需求。At present, in the prior art, the machine is limited to "speaking like a human" in the process of human-computer interaction, and does not consider the diverse needs of users for human-computer interaction.
发明内容Summary of the invention
本发明实施例提供了语音合成方法及相关设备,使得机器能够在人机交互过程中根据用户喜好或对话环境要求,为用户提供个性化的语音合成效果,改善人机对话的时效性,提升用户的语音交互体验。The embodiments of the present invention provide a speech synthesis method and related equipment, so that the machine can provide a personalized speech synthesis effect for the user according to user preferences or the requirements of the dialogue environment during the human-machine interaction process, improve the timeliness of the human-machine dialogue, and increase the user. Voice interaction experience.
第一方面,本发明实施例提供了一种语音合成方法,该方法可应用于终端设备,包括:终端设备接收用户的当前输入语音,根据用户的当前输入语音确定所述用户的身份;根据所述当前输入语音从预设在所述终端设备中的声学模型库中获得声学模型,所述声学模型的预设信息包括预设音速、预设音量、预设音高、预设音色、预设语调和预设韵律节奏中的两个或两个以上;终端设备根据所述用户的身份确定基础语音合成信息,所述用户的身份关联对应的基础语音合成信息,本发明实施例中所述基础语音合成信息又可称基础TTS参数,所述基础TTS参数用于表征语音合成中所采用的声学模型的预设音速、预设音量和所述音高中的一个或多个的变化量;根据所述当前输入语音确定回复文本;终端设备根据所述回复文本、或者根据所述回复文本以及上下文信息来确定强化语音合成信息,本发明实施例中所述强化语音合成信息又可称为强化TTS参数,所述强化TTS参数用于表征语音合成中所采用的声学模型的预设音色、预设语调和预设韵律节奏中的一个或多个的变化量;本发明实施例中终端设备能够根据所述回复文本、或者根据所述回复文本以及所述当前输入语音的上下文信息来确定当前对话的对话场景;终端设备通过所述声学模型(包括所述声学模型的预设信息),根据所述基础语音合成信息和所述强化语音合成信息对所述回复文本进行语音合成,得到用于呈现给用户的回复语音,从而实现了终端设备与用户的实时对话交互。亦即本发明实施例中,声学模型能够根据声学模型的预设信息以及预设信息的变化信息,将回复文本转换成回复语音。In a first aspect, an embodiment of the present invention provides a speech synthesis method that can be applied to a terminal device, including: the terminal device receives a user's current input voice, and determines the identity of the user according to the user's current input voice; The current input voice is obtained from an acoustic model library preset in the terminal device. The preset information of the acoustic model includes a preset sound speed, a preset volume, a preset pitch, a preset tone color, and a preset. Two or more of the intonation and the preset rhythm; the terminal device determines basic speech synthesis information according to the identity of the user, and the identity of the user is associated with corresponding basic speech synthesis information. The speech synthesis information may also be referred to as a basic TTS parameter. The basic TTS parameter is used to characterize one or more of a preset sound velocity, a preset volume, and the pitch of an acoustic model used in speech synthesis. The current input voice determines the reply text; the terminal device according to the reply text, or the reply text and context information Determine the enhanced speech synthesis information. The enhanced speech synthesis information described in the embodiments of the present invention may also be referred to as enhanced TTS parameters. The enhanced TTS parameters are used to characterize preset tones, preset tones, and Preset the amount of change of one or more of the prosody; in the embodiment of the present invention, the terminal device can determine the dialogue scene of the current dialogue according to the reply text, or according to the reply text and the context information of the currently input voice. ; The terminal device, through the acoustic model (including preset information of the acoustic model), performs speech synthesis on the reply text according to the basic speech synthesis information and the enhanced speech synthesis information, and obtains the Reply to the voice, thus realizing the real-time dialog interaction between the terminal device and the user. That is, in the embodiment of the present invention, the acoustic model can convert the reply text into the reply voice according to the preset information of the acoustic model and the change information of the preset information.
可选的,声学模型库中可包括多个声学模型(例如通用声学模型、个性化声学模型等)。这些声学模型皆为神经网络模型,这些神经网络模型可预先由不同的语料进行训练而成。对于每个声学模型而言,每个声学模型皆对应有各自的预设信息,也就是说每个声学模型分别绑定一特定的预设信息,这些预设信息可作为该声学模型的基础输入信息。Optionally, the acoustic model library may include multiple acoustic models (for example, a general acoustic model, a personalized acoustic model, etc.). These acoustic models are all neural network models, and these neural network models can be trained in advance from different corpora. For each acoustic model, each acoustic model has its own preset information, that is, each acoustic model is bound to a specific preset information, and these preset information can be used as the basic input of the acoustic model. information.
可选的,由于用户身份可与用户的个人喜好也是相关联的,所以终端也可以根据用户的个人爱好来确定基础语音合成信息。Optionally, since the identity of the user may be related to the personal preferences of the user, the terminal may also determine the basic speech synthesis information according to the personal preference of the user.
本发明实施例中,所述上下文信息可表示当前输入语音的上下文语境或者当前输入语音之前的历史输入语音。In the embodiment of the present invention, the context information may represent a context of a current input voice or a historical input voice before the current input voice.
可以看到,实施本发明实施例的技术方案,在用户与终端设备的人机语音交互中,终端设备一方面根据用户的输入语音生成相应的回复文本,另一方面能够基于对话交互的回复文本以及对话上下文信息,结合当前用户的身份、喜好以及对话情景选择个性化的TTS参数(TTS参数包括基础TTS参数和强化TTS参数),进而终端设备能够根据这些个性化的TTS参数、通过所选取的声学模型来生成特定风格的回复语音,从而实现向用户呈现个性化的语音合成效果,大大提升用户与终端的语音交互体验,改善人机对话的时效性。It can be seen that in implementing the technical solution of the embodiment of the present invention, in the human-machine voice interaction between the user and the terminal device, the terminal device generates a corresponding reply text based on the user's input voice on the one hand, and can respond to the interactive text based on the dialogue on the other hand And dialog context information, select personalized TTS parameters (TTS parameters include basic TTS parameters and enhanced TTS parameters) based on the current user's identity, preferences, and dialog scenarios, and the terminal device can then use the selected TTS parameters to pass the selected The acoustic model is used to generate a reply speech in a specific style, so as to present a personalized speech synthesis effect to the user, greatly improving the user's voice interaction experience with the terminal, and improving the timeliness of human-machine dialogue.
基于第一方面,在可能的实施方式中,终端设备也允许用户通过语音实时调教终端设备,更新与用户身份、喜好相关联的TTS参数,包括更新基础TTS参数和强化TTS参数,使得调教出来的终端更加贴近用户的交互喜好,最大化提升用户交互体验。Based on the first aspect, in a possible implementation manner, the terminal device also allows the user to tune the terminal device in real time through voice, and update the TTS parameters associated with the identity and preferences of the user, including updating the basic TTS parameters and strengthening the TTS parameters, so that the tuned out The terminal is closer to the user's interaction preferences and maximizes the user interaction experience.
基于第一方面,在可能的实施方式中,所述强化TTS参数可进一步分类为语音情感参数和语音场景参数等。所述语音情感参数用于使通过声学模型合成的语音呈现出具体的情感特征,根据情感特征的不同,语音情感参数可进一步分类为中性情感、轻度高兴、中度高兴、极度高兴、轻度悲伤、中度悲伤等参数。所述语音场景参数用于使通过声学模型合成的语音呈现出具体的场景特征,根据场景特征的不同,所述语音场景参数又可进一步划分为日常对话、诗词朗诵、歌曲哼唱、故事讲述、新闻播报等等参数,也就是说语音合成中采用这些语音场景参数将能够使合成语音呈现出日常对话、诗词朗诵、歌曲哼唱、故事讲述、新闻播报等语音场景的声音效果。Based on the first aspect, in a possible implementation manner, the enhanced TTS parameters may be further classified into speech emotion parameters and speech scene parameters. The speech emotion parameters are used to make the speech synthesized by the acoustic model present specific emotional characteristics. According to the different emotional characteristics, the speech emotion parameters can be further classified into neutral emotion, mild happiness, moderate happiness, extreme happiness, light Sadness, moderate sadness and other parameters. The speech scene parameters are used to make the speech synthesized by the acoustic model present specific scene characteristics. According to different scene characteristics, the speech scene parameters can be further divided into daily conversation, poetry recitation, song humming, storytelling, News broadcast and other parameters, that is to say, the use of these voice scene parameters in speech synthesis will enable the synthesized speech to show the sound effects of daily dialogue, poetry recitation, song humming, storytelling, news broadcast and other voice scenes.
下面以“诗词朗诵”为例描述在语音合成中采用“诗词朗诵”相关的语音场景参数的实施方式。The following uses "poem recitation" as an example to describe an implementation manner of using voice scene parameters related to "poem recitation" in speech synthesis.
本发明实施例中,确定当前对话为“诗词朗诵”的语音场景的方式可包括:In the embodiment of the present invention, the manner of determining the current scene as a voice scene of "poem recitation" may include:
(1)在对话过程中,用户的输入语音所包含的用户意图明确指示当前对话为“诗词朗诵”的语音场景;(1) During the dialogue, the user's input voice contains the user's intention to clearly indicate that the current dialogue is a "poem recitation" voice scene;
(2)在普通对话中,用户虽没有明确的意图明确指示当前对话为“诗词朗诵”,但终端设备还是可判断回复文本的内容是否涉及了诗、词、曲、赋等特定文学样式的一种或多种,比如涉及到五言绝句或七言绝句或律诗,或者涉及到具体的词牌或曲牌等;(2) In ordinary conversation, although the user has no clear intention to clearly indicate that the current conversation is "poem recitation", the terminal device can still determine whether the content of the reply text involves a particular literary style such as poetry, words, tunes, fu, etc. One or more types, such as five-character quatrains or seven-character quatrains or rhythmic poems, or specific word or song cards;
(3)终端设备预先存储各种文学样式(或句法格式)对应的字数、句子个数、每句字数的顺序等文学样式特征,通过分析回复文本中的标点(停顿)、字数、句子个数、每句字数的顺序等特征,将该回复文本中的一段文本或全部文本与预存的文学样式特征做匹配,如果匹配成功,则该符合预存的文学样式特征的一段文本或全部文本即可作为采用“诗词朗诵”的语音场景的文本。(3) The terminal device stores in advance literary style characteristics such as the number of words, the number of sentences, and the order of the number of words in each sentence, and analyzes the punctuation (pause), number of words, and number of sentences in the reply text. And the sequence of the number of words in each sentence, match a paragraph or all of the text in the reply text with the pre-stored literary style feature. If the match is successful, the paragraph or all text that meets the pre-stored literary style feature can be used as Text using a "poem recitation" voice scene.
本发明实施例中,“诗词朗诵”的语音场景注重语音的韵律节奏,“诗词朗诵”的语音场景参数用于调整符合特定文学样式(或句法格式)的输入文本的语音停顿位置/停顿时间(即对文本内容的分词)、单字或单词朗读时长、重音位置,从而实现对韵律节奏进行强化。强化后的韵律节奏相对于普通对话时的自然状态的韵律节奏而言,具有更加清晰和强烈情感表述,例如,在朗读诗词、儿歌排比句等特定句法格式时,强化后的韵律节奏能够产生的“抑扬顿挫”感觉。In the embodiment of the present invention, the speech scene of "Poetry Recitation" focuses on the rhythm of speech, and the speech scene parameters of "Poetry Recitation" are used to adjust the speech pause position / pause time of input text that conforms to a specific literary style (or syntax format) That is, the word segmentation of the text content), the length of the single word or word reading, and the position of the accent, so as to strengthen the rhythm. Compared with the natural rhythm of normal dialogues, the enhanced rhythmic rhythm has a clearer and stronger emotional expression. For example, when reading specific poems, nursery rhymes, and other specific syntactic formats, the enhanced rhythmic rhythm can produce "Suppression and frustration" feeling.
具体实现中,“诗词朗诵”的语音场景参数可通过韵律节奏模板来实现,对于每一种特定文学样式的文本内容,可对应于一种韵律节奏模板。所述文学样式表征了诗词曲的体裁,例如文学样式为古体诗、近体诗(如五言绝句、七言绝句)、律诗(如五言律诗、七言律诗)、词(如小令、中词、长词)、曲(包括各种曲调、曲牌等),对于每种韵律节奏模板而言,其定义了该模板中各个位置的字的音量变化(即该字音量的轻重)和音长的变化(即该字发音时间的长短)、以及该文本中语音的停顿位置/停顿时间(即对文本内容的分词),等等。In specific implementation, the speech scene parameters of "Poetry Recitation" can be realized by a rhythmic rhythm template, and for each specific literary style of text content, it can correspond to a rhythmic rhythm template. The literary style characterizes the genre of poetry, for example, the literary style is ancient poetry, near-style poetry (such as five-character quatrains, seven-character quatrains), rhythm poetry (such as five-character verses, seven-character verses), and words (such as small orders, middle words, Long words), tunes (including various tunes, song cards, etc.), for each rhythmic rhythm template, it defines the volume change of the word at each position in the template (that is, the weight of the word) and the change of the sound length ( That is, the length of time the word is pronounced), the pause position / pause time of the speech in the text (that is, the word segmentation of the text content), and so on.
具体的,在可能的实施方式中,当终端根据回复文本、上下文信息确定当前对话处于“诗词朗诵”的语音场景时,终端根据所述回复文本、上下文信息确定强化语音合成信息的过程具体包括:通过分析所述回复文本来确定所述回复文本的文学样式特征,所述文学样式特征包括所述回复文本中的部分或全部内容的句子个数、每句字数和句子字数的排列顺序中的一个或多个;根据所述回复文本涉及的文学样式特征选取对应的预设韵律节奏的变化量。所述预设韵律节奏的变化量即为所述韵律节奏模板,所述文学样式特征与所述韵律节奏模板之间具有对应关系。Specifically, in a possible implementation manner, when the terminal determines that the current conversation is in a "poem recitation" voice scene according to the reply text and context information, the process of the terminal determining the enhanced speech synthesis information according to the reply text and context information specifically includes: The literary style feature of the reply text is determined by analyzing the reply text, and the literary style feature includes one of the number of sentences, the number of words per sentence, and the arrangement order of the number of words in the reply text. Or a plurality of; selecting a corresponding change amount of the preset rhythm according to the literary style characteristics involved in the reply text. The change amount of the preset prosody rhythm is the prosody rhythm template, and there is a corresponding relationship between the literary style feature and the prosody rhythm template.
本发明具体实施例的“诗词朗诵”语音场景中,终端终端对回复文本的内容进行韵律节奏模板对齐,以便于后续的语音合成。具体的,在需要进行语音合成时,终端可将回复文本中的相关内容与“诗词朗诵”语音场景的韵律节奏模板进行对齐。具体的,终端可将回复文本中的相关内容对应声学模型库的读音与韵律节奏模板的参数结合,参考一定的尺度将韵律节奏模板的参数叠加到这些读音语段中。In the "poetry recitation" voice scene of the specific embodiment of the present invention, the terminal performs rhythmic template alignment on the content of the reply text, so as to facilitate subsequent speech synthesis. Specifically, when speech synthesis is required, the terminal may align the relevant content in the reply text with the rhythmic template of the "poem recitation" voice scene. Specifically, the terminal may combine the pronunciation of the corresponding acoustic model library with the relevant content in the reply text and the parameters of the prosodic rhythm template, and refer to a certain scale to superimpose the parameters of the prosodic rhythm template into these pronunciation segments.
例如,在一种示例性的实施例中,韵律加强参数为ρ(0<ρ<1),文本内容中第i个字的预设音量为Vi,若该字的韵律节奏特征包含了重读特征,其重读变化量为E1,则该字的最终音量为Vi×(1+E1)×(1+ρ)。又例如,文本中第i个字的基础音长为Di,音长的变化量为E2,则该字的最终音长为Di×(1+E2)。又例如,第i个字和第i+1个字之间需要停顿,停顿时间从0s变为0.02s。For example, in an exemplary embodiment, the prosody enhancement parameter is ρ (0 <ρ <1), and the preset volume of the i-th word in the text content is Vi. If the prosodic rhythm feature of the word includes the accent feature , Whose re-reading change is E1, then the final volume of the word is Vi × (1 + E1) × (1 + ρ). For another example, if the basic sound length of the i-th word in the text is Di and the change amount of the sound length is E2, then the final sound length of the character is Di × (1 + E2). As another example, a pause is required between the i-th word and the i + 1-th word, and the pause time is changed from 0s to 0.02s.
基于第一方面,在可能的实施方式中,声学模型库中可包括通用声学模型和若干个个性化声学模型,其中:Based on the first aspect, in a possible implementation manner, the acoustic model library may include a general acoustic model and several personalized acoustic models, where:
通用声学模型的预设信息可包括其该模型的预设音速、预设音量、预设音高、预设音色、预设语调、预设韵律节奏等,通过通用声学模型合成的语音呈现正常、通用对话场景下的声音效果。The preset information of the general acoustic model may include the preset sound speed, preset volume, preset pitch, preset tone, preset tone, preset rhythm, etc. of the model. The speech synthesized by the general acoustic model is normal, Sound effects in general dialogue scenarios.
个性化声学模型的预设信息可包括语音特征以及语言风格特征。也就是说,个性化声学模型的预设信息除了包括该模型的预设音速、预设音量、预设音高、预设音色、预设语调、预设韵律节奏中的两个或两个以上外,还可包括其他的个性化信息,比如包括口头禅、对特定场景的应答方式、智慧类型、性格类型、夹杂的流行语言或方言、对特定人物的称 谓等等语言风格特征中的一个或多个。通过个性化声学模型合成的语音能够“人物模仿”的对话场景的声音效果。The preset information of the personalized acoustic model may include voice characteristics and language style characteristics. That is, the preset information of the personalized acoustic model includes two or more of the preset sound speed, preset volume, preset pitch, preset tone, preset tone, and preset rhythm. In addition, it can also include other personalized information, such as one or more of language style characteristics including mantras, response methods to specific scenes, wisdom types, personality types, mixed popular languages or dialects, and titles of specific characters. Each. The speech synthesized by the personalized acoustic model can "simulate" the sound effect of the dialogue scene.
需要理解的是,不同声学模型的预设音速、预设音量、预设音高、预设音色、预设语调、预设韵律节奏等预设信息也各有差异,举例来说,个性化声学模型的预设信息可明显不同于通用声学模型的预设信息。It should be understood that the preset sound speed, preset volume, preset pitch, preset tone, preset tone, preset rhythm, and other preset information of different acoustic models also differ. For example, personalized acoustics The preset information of the model may be significantly different from the preset information of the general acoustic model.
下面以“人物模仿”为例描述在语音合成中采用“人物模仿”相关的声学模型的实施方式。The following uses "character imitation" as an example to describe the implementation of an acoustic model related to "character imitation" in speech synthesis.
本发明实施例中,终端设备可通过用户的输入语音确定当前对话需要采用“人物模仿”的声学模型,具体包括几种方式:In the embodiment of the present invention, the terminal device may determine, through user input voice, that the current conversation needs to adopt an acoustic model of "character imitation", which specifically includes several methods:
(1)在对话过程中,用户的输入语音所包含的用户意图明确指示当前对话为“人物模仿”的场景,终端设备确定了用户意图后,进而确定当前对话为“人物模仿”的场景。举例来说,用户输入语音指示终端用林志玲的声音说话,那么终端识别出用户意图后,自动将当前对话场景设置为“人物模仿”的场景。(1) During the dialogue, the user's intention contained in the user's input voice clearly indicates that the current dialogue is "character imitation". After the terminal device determines the user's intention, it further determines that the current conversation is "character imitation". For example, if the user inputs a voice to instruct the terminal to speak with Lin Zhiling's voice, then after the terminal recognizes the user's intention, the terminal automatically sets the current dialogue scene as a "character imitation" scene.
(2)在普通对话中,用户虽没有明确的意图明确指示当前对话为“人物模仿”,但终端设备还是可判断用户的输入语音对应的输入文本的内容是否涉及了人物模仿的内容。具体实现中,可通过全文匹配、关键词匹配和语义相似度匹配等方式来确定可以进行角色模仿的回复内容,这些内容包括歌词、声音特效、电影台词和动画片对话脚本等。(2) In a normal conversation, although the user has no clear intention to clearly indicate that the current conversation is "character imitation", the terminal device can still determine whether the content of the input text corresponding to the user's input voice involves the content of character imitation. In specific implementation, the content of the reply that can be imitated by characters can be determined by means of full-text matching, keyword matching, and semantic similarity matching. These contents include lyrics, sound effects, movie lines and cartoon dialogue scripts.
本发明具体实施例中,终端设备的声学模型库预设有用于实现“人物模仿”的各种声学模型(即个性化声学模型)。“人物模仿”的声学模型可用于使合成语音具有特定人物的声音特点,所以”人物模仿”的声学模型的预设音色、预设语调、预设韵律节奏等信息与通用声学模型的这些信息会有所差异。“人物模仿”的声学模型所模仿的人物可能是用户本身的喜好的人物形象,可能是影视作品中的人物角色,还可能是多种预设声模与用户喜好的综合,例如,这些“人物模仿”的声学模型可以是用户模仿用户自身说话风格的声学模型;还可以是模仿其他人物说话特点的声学模型,例如用于模仿“林志玲/柔美声音”的声学模型,可以是模仿“小沈阳/搞笑声音”的声学模型,可以是模仿“刘德华/浑厚声音”的声学模型,等等。此外,在可能的实施例中,终端在语音合成过程中选取的并不是声学模型库中某个具体的声学模型,而是声学模型库中的多个声学模型的综合模型(又称为融合模型)。In a specific embodiment of the present invention, the acoustic model library of the terminal device is preset with various acoustic models (that is, personalized acoustic models) for implementing "character imitation". The acoustic model of "character imitation" can be used to make the synthesized speech have the sound characteristics of a specific character. Therefore, the information of the preset voice, preset tone, and preset rhythm of the acoustic model of "character imitation" and this information of the general acoustic model will be There are differences. The character imitated by the "personal imitation" acoustic model may be the user's favorite personal image, a character in a film or television work, or a combination of multiple preset sound modes and user preferences. For example, these "persons" The acoustic model of “imitation” can be an acoustic model in which the user imitates the user's own speaking style; it can also be an acoustic model that mimics the speaking characteristics of other characters, for example, an acoustic model used to imitate “Lin Zhiling / Soft Voice”, which can imitate “Little Shenyang / The acoustic model of "funny sound" can be an acoustic model that mimics "Andy Lau / Deep Voice", and so on. In addition, in a possible embodiment, during the speech synthesis process, the terminal does not select a specific acoustic model in the acoustic model library, but a comprehensive model (also referred to as a fusion model) of multiple acoustic models in the acoustic model library. ).
终端从声学模型库中获取用于实现“人物模仿”对应的声学模型的实现方式可以包括以下几种:The terminal may obtain the acoustic model corresponding to "character imitation" from the acoustic model library, and the implementation methods may include the following:
(1)终端设备可根据用户的身份从声学模型库中选取某一个声学模型或者某一个融合模型。具体的,由于用户的身份可与用户的喜好相关联,终端设备可根据用户的身份确定用户的喜好,进而根据用户的喜好从声学模型库中选取某一个声学模型或者某一个融合模型。例如选择所喜欢的用于模仿“林志玲/柔美声音”的声学模型、或者、模仿“小沈阳/搞笑声音”的声学模型,或者,模仿“刘德华/浑厚声音”的声学模型,或者某个预先设置的融合模型等等。(1) The terminal device may select a certain acoustic model or a certain fusion model from the acoustic model library according to the identity of the user. Specifically, since the identity of the user may be associated with the preference of the user, the terminal device may determine the preference of the user according to the identity of the user, and then select a certain acoustic model or a certain fusion model from the acoustic model library according to the preference of the user. For example, you can choose your favorite acoustic model to imitate "Lin Zhiling / Soft Voice", or the acoustic model to imitate "Little Shenyang / Funny Voice", or the acoustic model to imitate "Andy Lau / Various Voice", or some preset Fusion model and much more.
需要说明的是,所述用户喜好的声学模型未必是声学模型库中原本设置的个性化声学模型,而可能是根据用户的喜好对某个性化声学模型进行参数微调后的声学模型。举例来 说,声学模型库中原本设置的某一个个性化声学模型的声音特征包括第一语速(音速)、第一语调、第一韵律节奏、第一音色。终端通过对用户喜好的分析或者用户的手动设置,确定用户最喜欢的各种参数组合为:0.8倍第一语速,1.3倍第一语调,0.9倍第一韵律节奏,1.2倍第一女性化音色,从而对这些参数进行相应调整,从而得到满足用户需求的个性化声学模型。It should be noted that the acoustic model preferred by the user is not necessarily the personalized acoustic model originally set in the acoustic model library, but may be an acoustic model obtained by fine-tuning parameters of a personalized acoustic model according to the preference of the user. For example, the sound characteristics of a personalized acoustic model originally set in the acoustic model library include a first speech speed (speed of sound), a first intonation, a first rhythm, and a first tone color. The terminal determines the user's favorite various parameter combinations through analysis of user preferences or manual settings by the user: 0.8 times the first speech speed, 1.3 times the first intonation, 0.9 times the first rhythm, and 1.2 times the first feminization Tone color, so that these parameters are adjusted accordingly to obtain a personalized acoustic model that meets user needs.
(2)终端设备根据所述当前输入语音的内容,确定与所述当前输入语音的内容相关的声模标识,从所述声学模型库中选取对应于所述声模标识的声学模型。例如,终端可根据输入文本或用户喜好或回复文本确定当前合成语音需要采用“周星驰”类型的声音,则从声学模型库中选取“周星驰”声音类型的声学模型。(2) The terminal device determines an acoustic mode identifier related to the content of the current input voice according to the content of the currently input voice, and selects an acoustic model corresponding to the acoustic mode identifier from the acoustic model library. For example, the terminal may determine that the currently synthesized speech needs to use a "Chou Xingchi" type of sound according to the input text or user preference or reply text, and then select an acoustic model of the "Chou Xingchi" type of sound from the acoustic model library.
(3)终端设备根据所述用户的身份选取所述声学模型中的多个声学模型后,确定所述多个声学模型中的各个声学模型的权重值(即喜好系数);其中,所述各个声学模型的权重值是用户预先设置的,或者,所述各个声学模型的权重值是预先根据所述用户的喜好而确定的;然后将所述各个声学模型基于所述权重值进行融合,获得融合后的声学模型。(3) After the terminal device selects multiple acoustic models in the acoustic model according to the identity of the user, determines a weight value (that is, a preference coefficient) of each acoustic model in the multiple acoustic models; The weight values of the acoustic models are set in advance by the user, or the weight values of the respective acoustic models are determined in advance according to the preferences of the user; and then the respective acoustic models are fused based on the weight values to obtain a fusion Acoustic model after.
假如,在终端设备获取了用户对语音的喜好或需求之后,也可直接根据用户的身份(即用户的喜好或需求直接绑定于用户的身份)在多个声学模型的声音里进行匹配,从而确定用户对浑厚、柔美、可爱、搞笑等声音类型的喜爱系数分别为0.2、0.8和0.5,即,即这些声学模型的权重分别为0.2、0.8和0.5,将每种声音类型的语速音速、语调、韵律节奏、音色等进行加权叠加,即可得到最终的声学模型(即融合模型)。这样合成的语音场景参数在语速、语调、韵律节奏、音色上实现了对声学模型的声音转换,有利于产生类似“说话风趣的林志玲”或者“说唱模式林志玲”这样混合的声音效果。For example, after the terminal device obtains the user's preferences or needs for voice, it can also directly match the sounds of multiple acoustic models according to the user's identity (that is, the user's preferences or needs are directly tied to the user's identity), thereby Determine the user's favorite coefficients for thick, soft, cute, funny, and other sound types are 0.2, 0.8, and 0.5, that is, the weights of these acoustic models are 0.2, 0.8, and 0.5, respectively. The final acoustic model (ie, fusion model) can be obtained by weighting and superimposing intonation, prosody, and tone color. The synthesized speech scene parameters realize the sound conversion of the acoustic model in terms of speech rate, intonation, rhythm, and timbre, which is conducive to producing mixed sound effects such as "talking Lin Zhiling" or "rapper mode Lin Zhiling".
基于第一方面,在可能的实施方式中,TTS参数还包括目标字符与用户偏好读音之间的对应关系。所述定制字符读音表包括目标字符与用户偏好读音之间的映射关系。所述目标字符与用户偏好读音之间的映射关系用于使通过声学模型合成的语音所涉及的目标字符具有用户所偏好的读音。所述目标字符与用户偏好读音之间的映射关系与用户的身份相关联,也就是说可根据用户的身份来组织不同的映射关系。Based on the first aspect, in a possible implementation manner, the TTS parameter further includes a correspondence between a target character and a user's preferred pronunciation. The customized character pronunciation table includes a mapping relationship between a target character and a user's preferred pronunciation. The mapping relationship between the target character and the user's preferred pronunciation is used to enable the target character involved in the speech synthesized by the acoustic model to have the user's preferred pronunciation. The mapping relationship between the target character and the user's preferred pronunciation is associated with the identity of the user, that is, different mapping relationships can be organized according to the identity of the user.
本发明实施例中,定制字符读音表可按用户身份进行组织和存储,未注册用户对应的定制字符读音表为空,而已注册用户对应的定制字符读音表可基于该用户的喜好进行新增、更改、删除等操作。设置操作的对象可以是终端容易误读的或者用户所喜好的字、人名/地名、字母、特殊符号等等。定制字符读音表包括目标字符(串)与用户偏好读音之间的映射关系,目标字符(串)可以是字(汉字或外文)、词语、短语、句子,还可以是数字、符号(如中文字符、外文字符、颜文字、标点符号、特殊符号…)等等。In the embodiment of the present invention, the customized character pronunciation table can be organized and stored according to user identity. The customized character pronunciation table corresponding to an unregistered user is empty, and the customized character pronunciation table corresponding to a registered user can be added based on the user's preference. Change, delete, etc. The object of the setting operation may be a character, a person / place name, a letter, a special symbol, etc. that are easily misunderstood by the terminal or the user likes. The customized character pronunciation table includes the mapping relationship between the target character (string) and the user's preferred pronunciation. The target character (string) can be a character (Chinese character or foreign language), a word, a phrase, a sentence, or a number or a symbol (such as a Chinese character). , Foreign characters, emoticons, punctuation, special symbols ...) and more.
具体的,终端设备可以预先根据所述用户的历史输入语音确定目标字符与用户偏好读音之间的对应关系,将所述目标字符与用户偏好读音之间的对应关系关联所述用户的身份,写入所述定制字符读音表。Specifically, the terminal device may determine the correspondence between the target character and the user's preferred pronunciation according to the historical input voice of the user, associate the correspondence between the target character and the user's preferred pronunciation with the identity of the user, and write Enter the custom character pronunciation table.
比如终端原本声学模型对“小猪佩奇”生成的读音为“xiao3 zhu1 pei4 qi2”,如果用户预先的通过语音调教终端设备,请求将短语“小猪佩奇”中的“奇”的读音设置为“ki1”,则终端设备将记录“小猪佩奇”与“xiao3 zhu1 pei4 ki1”作为一映射关系,并将还映射关系写入“xiaoming”关联的定制字符读音表。For example, the original acoustic model of the terminal ’s “Piglet Page” reads “xiao3, zhu1, pei4, and qi2”. If the user tunes the terminal device through voice in advance, it is requested to set the “odd” pronunciation in the phrase “Pigpage”. If it is "ki1", the terminal device will record the "small pig page" and "xiao3" as a mapping relationship, and write the mapping relationship into the custom character pronunciation table associated with "xiaoming".
又例如,终端设备可在上下文信息中找出终端在上一轮对话或上几轮对话所输出的对话文本,确定该对话文本中各字词的读音(如使用声学模型来确定读音)。例如,终端在上一轮对话的输出文本为“很高兴认识你,小茜”,终端确定其对应的读音为“hen3 gao1 xing4 ren4 shi2 ni3,xiao3 xi1”。这样,DM模块将所述误读读音同该所述输出文本的读音串进行匹配,就可以确定误读读音“xiao3 xi1”所对应的中文字词为“小茜”,即“小茜”为目标字词(即要更正的目标字符)。进而终端设备将目标字词“小茜”以及目标读音“xiao3 qian4”作为新的目标字符-读音对加入到与当前用户身份关联的定制字符读音表。For another example, the terminal device may find the dialogue text output by the terminal in the last round or previous rounds of conversation in the context information, and determine the pronunciation of each word in the dialogue text (such as using an acoustic model to determine the pronunciation). For example, the output text of the terminal in the last round of conversation was "I'm glad to meet you, Xiao Qian", and the terminal determined that its corresponding pronunciation is "hen3, gao1, xing4, ren4, shi2, ni3, xiao3xi1". In this way, the DM module matches the misreading pronunciation with the pronunciation string of the output text to determine that the Chinese word corresponding to the misreading pronunciation "xiao3xiaoxi1" is "小茜", that is, "小茜" is Target term (the target character to be corrected). Furthermore, the terminal device adds the target word “小茜” and the target pronunciation “xiao3 qian4” as new target character-phonetic pairs to a custom character pronunciation table associated with the current user identity.
这样,在当前对话的语音合成中,当终端设备发现所述回复文本中存在与所述用户的身份关联的所述目标字符时,则通过所述声学模型,根据所述目标字符与用户偏好读音之间的对应关系、所述基础语音合成信息和所述强化语音合成信息对所述回复文本进行语音合成。例如,在当前实时人机对话中,当终端设备的回复文本中含有“小茜”时,终端设备将会根据定制字符读音表的记录来确定“小茜”的读音为“xiao3 qian4”。这样,通过声学模型进行语音合成得到的回复语音中“小茜”的读音即为“xiao3 qian4”。In this way, in the speech synthesis of the current conversation, when the terminal device finds that the target character is associated with the identity of the user in the reply text, it uses the acoustic model to read the target character according to the target character and the user's preference. The correspondence relationship among them, the basic speech synthesis information and the enhanced speech synthesis information perform speech synthesis on the reply text. For example, in the current real-time man-machine conversation, when the response text of the terminal device contains "Xiao Qian", the terminal device will determine that the pronunciation of "Xiao Qian" is "xiao3 qian4" according to the record of the customized character pronunciation table. In this way, the pronunciation of "Xiao Qian" in the reply speech obtained by speech synthesis through the acoustic model is "xiao3 qian4".
基于第一方面,在可能的实施方式中,TTS参数还包括背景音效,也就是说,TTS参数库中可包含音乐库,所述音乐库包括多个音乐信息,这些音乐信息用于语音合成过程中提供背景音效。所述背景音效具体是指音乐中的某个音乐片段(如纯乐曲或歌曲)或者声音特效(如影视音效、游戏音效、语言音效、动画音效等)。所述背景音效用于使通过声学模型合成出来的语音背景中叠加有不同风格、节奏的音乐或声音效果,从而增强合成语音的表达效果(比如增强情感效果。Based on the first aspect, in a possible implementation manner, the TTS parameter further includes a background sound effect, that is, the TTS parameter database may include a music library, the music library includes multiple music information, and the music information is used in a speech synthesis process. Provides background sound effects in. The background sound effect specifically refers to a certain music segment (such as pure music or song) or sound special effects (such as movie sound effects, game sound effects, language sound effects, animation sound effects, etc.) in the music. The background sound effect is used to superimpose different styles and rhythms of music or sound effects on the speech background synthesized by the acoustic model, thereby enhancing the expression effect of the synthesized speech (such as enhancing the emotional effect).
下面以对合成语音叠加“背景音效”的场景为例来描述本发明实施例的语音合成方法。The following describes a method for synthesizing speech in an embodiment of the present invention by using a scene in which a synthesized speech is superimposed with a "background sound effect" as an example.
本发明实施例中,当终端设备确定回复文本中具有适合叠加背景音乐的内容时,才需要在合成语音中叠加背景音效。具体的,终端设备可自动判断适合叠加背景音乐的内容。这些适合叠加背景音乐的内容可以是具有情感极性的文字,可以是诗歌词曲,可以是影视台词等等。举例来说,终端可通过DM模块识别句子中有情感倾向的词语,进而通过语法规则分析、机器学习分类等方法来确定回复文本中的短语、句子或者整个回复文本的情感状态。这个过程,可借助情感词典来识别这些有情感倾向的词语,情感词典是一个词语集合,该集合内的词都有明显的情感极性倾向,且情感词典也包含了这些词语的极性信息,例如,词典中的文字被标识了如下情感极性:快乐(happy)、喜欢(like)、悲伤(sadness)、惊讶(surprise)、愤怒(angry)、恐惧(fear)、厌恶(disgust)等情感极性类型,可能实施例中,不同的情感极性类型甚至还可进一步划分为多种程度的情感强度(如划分为五档的情感强度)。In the embodiment of the present invention, when the terminal device determines that the reply text has content suitable for superimposing background music, it is only necessary to superimpose a background sound effect on the synthesized speech. Specifically, the terminal device may automatically determine content suitable for superimposing background music. The content suitable for superimposing background music can be words with emotional polarity, can be poetry, can be film and television lines, and so on. For example, the terminal can identify the sentiment-oriented words in the sentence through the DM module, and then determine the emotional state of the phrase, sentence, or the entire reply text through methods such as grammatical rule analysis and machine learning classification. In this process, the emotional dictionary can be used to identify these emotionally inclined words. The emotional dictionary is a collection of words, and the words in the collection have obvious emotional polarity tendencies, and the emotional dictionary also contains the polarity information of these words. For example, the words in the dictionary are identified with the following emotional polarity: happy, like, sadness, surprise, angry, fear, disgust, and other emotions Polarity type. In a possible embodiment, different types of emotional polarity can be further divided into multiple levels of emotional intensity (for example, divided into five levels of emotional intensity).
在确定回复文本存在适合叠加背景音效的内容后,终端从所述音乐库确定要叠加的背景音效。具体的,终端预先为音乐库中的各个音乐文件的不同片段(即子片段)设置情感极性类别的标识,例如这些片段被标识如下情感极性类型:快乐(happy)、喜欢(like)、悲伤(sadness)、惊讶(surprise)、愤怒(angry)、恐惧(fear)、厌恶(disgust)等。假设当前回复文本包括具有情感极性的文字,那么在确定了这些文字的情感极性类别后,终端设备在音乐库中查找具有相应的情感极性类别标识的音乐文件。在可能实施例中,如果情感极性类型还可进一步划分为多种程度的情感强度,则预先为音乐库中的各个子片段 设置情感极性类别和情感强度的标识,那么在确定了这些文字的情感极性类别和情感强度后,在音乐库中查找具有相应的情感极性类别和情感强度的标识的子片段组合作为最终选取的背景音效。After determining that there is content suitable for superimposing a background sound effect in the reply text, the terminal determines a background sound effect to be superimposed from the music library. Specifically, the terminal sets an identification of an emotional polarity category for different segments (ie, sub-segments) of each music file in the music library in advance. For example, these segments are identified as the following emotional polarity types: happy, like, Sadness, surprise, anger, fear, disgust, etc. Assuming that the current reply text includes text with emotional polarity, after determining the emotional polarity category of these texts, the terminal device searches the music library for a music file with a corresponding emotional polarity category identifier. In a possible embodiment, if the type of emotional polarity can be further divided into multiple levels of emotional intensity, the emotional polarity category and the identification of the emotional intensity are set in advance for each sub-segment in the music library, then these texts are determined After the emotion polarity category and emotion intensity of the subject, find a sub-segment combination with the corresponding emotion polarity category and emotion intensity identifier in the music library as the final selected background sound effect.
下面描述终端设备根据回复文本的部分或全部内容,在所述预设的音乐库中选取最匹配的背景音效的过程。终端设备可将回复文本中需要叠加背景音效地内容拆分成不同的部分(根据标点进行拆分或者进行分词处理),每个部分可称为一个子内容,计算每个子内容的情感极性类型和情感强度。进而,在音乐库中确定将该内容最匹配的背景音效后,将该内容对齐所匹配的背景音效,使得该内容的情感变化与背景音效的情感变化基本一致。具体的,所述最匹配的背景音效包括多个子片段,各个子片段分别具有情感极性类型的标识和情感强度的标识,所述各个子片段具有的情感极性类型的标识所指示的情感极性类型分别与所述各个子内容的情感极性类型相同,且所述各个子片段具有的情感强度的标识所指示的情感强度之间的变化趋势与所述各个子内容的情感强度之间的变化趋势相一致。The following describes the process in which the terminal device selects the most matching background sound effect in the preset music library according to part or all of the reply text. The terminal device can split the content in the reply text that needs to be superimposed with background sound effects into different parts (split or punctuated according to punctuation). Each part can be called a sub-content, and the emotional polarity type of each sub-content is calculated. And emotional intensity. Furthermore, after determining the background sound effect that best matches the content in the music library, align the content with the matched background sound effect, so that the emotional change of the content is basically consistent with the emotional change of the background sound effect. Specifically, the best-matching background sound effect includes a plurality of sub-segments, each of which has an identification of an emotional polarity type and an identification of an emotional intensity, and the emotional polarity indicated by the identification of the emotional polarity type of each sub-segment The sexual type is the same as the emotional polarity type of each sub-content, and the change trend between the emotional intensity indicated by the identifier of the emotional intensity of each sub-segment and the emotional intensity of each sub-content The trends are consistent.
举例来说,在一应用场景中,回复文本为“天气不错,国足又赢球了,好开心”,该回复文本的全部内容需要叠加背景音效,该回复文本拆分成“天气不错,”“国足又赢球了,”“好开心”三部分的子内容,且该各个子内容的情感极性类别皆为高兴(happy),且分别有不同的情感强度。在音乐库初步确定一个情感极性类别为高兴(happy)的音乐文件,进一步地,可以对该音乐文件的情感变化轨迹进行计算和统计,得到该音乐中三个子片段的情感强度,这三个子片段的情感变化与回复文本的三部分的子内容的情感变化趋势基本一致,所以由这个音乐文件中的这三个子片段组成的音乐片段即为与回复文本相匹配的背景音效。故可以将复文本的“天气不错,”“国足又赢球了,”“好开心”分别对齐这三个子片段,这样,后续在语音合成中,终端设备通过所选取的声学模型,根据所述背景音效(即最匹配的音乐片段)、所述基础语音合成信息和所述强化语音合成信息对所述回复文本进行语音合成,所输出的最终的回复语音将会呈现“语音叠加背景音效”的效果。For example, in an application scenario, the reply text is "The weather is good, the national football team has won again, so happy." The entire content of the reply text needs to be superimposed with background sound effects. The reply text is split into "Good weather," " The national football team won the game again, and the sub-contents of the three parts "" Happy ", and the emotional polarity categories of each sub-content are all happy, and have different emotional strengths. A music file whose emotional polarity category is happy is initially determined in the music library. Further, the emotional change trajectory of the music file can be calculated and counted to obtain the emotional intensity of the three sub-segments in the music. The emotional change of the fragment is basically consistent with the emotional change trend of the three sub-contents of the reply text, so the music fragment composed of the three sub-segments in this music file is the background sound effect that matches the reply text. Therefore, you can align the three sub-segments of "the weather is good," "the national football team won again," and "happy" in the complex text. In this way, in the speech synthesis, the terminal device uses the selected acoustic model according to the The background sound effect (that is, the most matched music fragment), the basic speech synthesis information, and the enhanced speech synthesis information are used to speech synthesize the reply text, and the final reply speech output will present a "speech superimposed background sound effect" effect.
基于第一方面,在可能的实施方式中,当前对话场景还可能是“儿歌哼唱”的语音场景,这种场景中,终端设备在语音合成中所采用的强化语音合成信息即包括“儿歌哼唱”的语音场景参数。Based on the first aspect, in a possible implementation manner, the current dialogue scene may also be a “song song humming” voice scenario. In this scenario, the enhanced speech synthesis information used by the terminal device in the speech synthesis includes the “child song humming” Sing "voice scene parameters.
下面以“歌曲哼唱(以儿歌哼唱为例)”的语音场景为例来描述本发明实施例的语音合成方法。The speech synthesis method of the embodiment of the present invention is described below by taking a speech scene of "song humming (taking a nursery rhyme as an example)" as an example.
在音乐中,时间被分成均等的基本单位,每个基本单位叫做一个“拍子”或称一拍。拍子的时值是以音符的时值来表示的,一拍的时值可以是四分音符(即以四分音符为一拍),也可以是二分音符(以二分音符为一拍)或八分音符(以八分音符为一拍)。音乐的节奏一般以节拍来定义,例如4/4拍:4/4拍是4分音符为一拍,每小节4拍,可以有4个4分音符。所谓“儿歌哼唱”的语音场景参数,即预设各种各样儿歌的节拍类型,以及对需要以“儿歌哼唱”方式进行语音合成的回复文本内容进行文本分割的方式。In music, time is divided into equal basic units, and each basic unit is called a "beat" or a beat. The time value of the beat is expressed by the time value of the note. The time value of a beat can be a quarter note (that is, a quarter note is a beat), a half note (a quarter note is a beat), or eight. Quarter note (takes eighth note as a beat). The rhythm of music is generally defined by beats, such as 4/4 beats: 4/4 beats are 4 quarter notes as a beat, and 4 beats per measure can have 4 quarter notes. The voice scene parameters of the so-called "Children's Song Humming" are preset types of beats of various children's songs, and a method of text segmentation of the content of the reply text that needs to be synthesized in the manner of "Children's Song Humming".
本发明实施例中,终端通过回复文本、上下文信息确定当前对话的语音场景为“儿歌哼唱”的语音场景。In the embodiment of the present invention, the terminal determines that the voice scene of the current conversation is the voice scene of "child songs humming" by replying to text and context information.
一种方式是在对话过程中,用户的输入语音所包含的用户意图明确指示当前对话为“儿歌哼唱”的语音场景。One way is that during the conversation, the user's input voice contains a user's intention to clearly indicate that the current conversation is a "child song humming" voice scene.
还有一种方式是在普通对话中,用户虽没有明确的意图明确指示当前对话为“儿歌哼唱”,但终端还是可通过DM模块也可以判断回复文本的内容是否涉及了儿歌的内容。具体实现中,DM模块可通过文本搜索匹配或语义分析等方法,搜索本地预存的儿歌库或者搜索网络服务器中的儿歌库,儿歌库中可包含各种各样的儿歌的歌词,DM模块进而判断回复文本的内容是否存在于这些儿歌歌词中,若存在,则将当前对话场景设置为“儿歌哼唱”的语音场景。Another way is that in a normal conversation, although the user has no clear intention to clearly indicate that the current conversation is "Children's Song Humming", the terminal can still determine whether the content of the reply text involves the content of children's songs through the DM module. In specific implementation, the DM module can search the local pre-stored nursery rhyme library or search the nursery rhyme library in the web server through text search matching or semantic analysis. The lyric library can contain various lyrics of the nursery rhymes, and the DM module judges Whether the content of the reply text exists in these children's song lyrics, and if it exists, the current dialogue scene is set to the voice scene of "Children's Song Humming".
本发明实施例中,终端设备可对回复文本的内容进行节拍对齐,以便于后续的语音合成。具体的,具体实施例中,终端可通过PM模块将回复文本的内容对齐所确定的节拍,以保证文本的各个字段与儿歌节拍的变化规律相融合。具体的,终端将切割好的文本字段按照节拍的变化规律与时间轴对齐。In the embodiment of the present invention, the terminal device may perform beat alignment on the content of the reply text to facilitate subsequent speech synthesis. Specifically, in a specific embodiment, the terminal may align the content of the reply text with the determined beat through the PM module, so as to ensure that each field of the text is fused with the change rule of the rhythm of the nursery rhyme. Specifically, the terminal aligns the cut text field with the time axis according to the change rule of the beat.
举例来说,回复文本中的某个字段的字数为3,其匹配的节拍为3/3或者3/4拍,那么可将这3个字分别与一个小节内的3个拍子分别对齐。For example, if the number of words in a field in the reply text is 3, and the matching beat is 3/3 or 3/4, then the 3 words can be aligned with the 3 beats in a measure.
又举例来说,回复文本中的某个字段的字数小于小节内拍子的数量,如该字段为2个字,而节拍为4/4拍,则搜索该字段前后相邻的文本字段,如果该字段之前的字段(或该字段之后的字段)的字数为2,则可以将该字段和该字段之前的字段合并,共同对齐小节内的4个拍子。如果前后的字段无法合并,或者合并后的字数仍然小于节拍数,则还可进一步通过以下方式进行节拍对齐:一种方式是将文字比节拍数少的部分用空白填补。另一种方式是通过拉长某一个字的音长来对齐节奏。再一种方式是平均拉长各个字的音长保证整体时间对齐。For another example, the number of words in a field in the reply text is less than the number of beats in the measure. If the field is 2 words and the beat is 4/4 beats, then search for adjacent text fields before and after the field. The number of words in the field before the field (or the field after the field) is 2, then you can merge this field with the field before the field to align the 4 beats in the measure together. If the fields before and after cannot be merged, or the number of words after the merge is still less than the number of beats, you can further align the beats in the following ways: One way is to fill the part with less text than the number of beats with blanks. Another way is to align the rhythm by lengthening the sound length of a word. Another way is to lengthen the sound length of each word evenly to ensure the overall time alignment.
第二方面,本发明实施例提供了一种语音合成设备,该设备包括处理器以及与所述处理器耦合的存储器,其中:In a second aspect, an embodiment of the present invention provides a speech synthesis device. The device includes a processor and a memory coupled to the processor, where:
存储器用于,存储声学模型库和语音合成参数库(可简称为TTS参数库),所述声学模型库保存有一个或多个声学模型,所述语音合成参数库保存有与用户的身份相关联的基础语音合成信息,以及强化语音合成信息;The memory is used to store an acoustic model library and a speech synthesis parameter database (may be referred to as a TTS parameter library). The acoustic model library stores one or more acoustic models, and the speech synthesis parameter database stores associations with the identity of the user. Basic speech synthesis information and enhanced speech synthesis information;
所述处理器用于:根据用户的当前输入语音确定所述用户的身份;根据所述当前输入语音从所述声学模型库中获得声学模型,所述声学模型的预设信息包括预设音速、预设音量、预设音高、预设音色、预设语调和预设韵律节奏中的两个或两个以上;根据所述用户的身份从所述语音合成参数库中确定基础语音合成信息,所述基础语音合成信息包括所述预设音速、所述预设音量和所述预设音高中的一个或多个的变化量;根据所述当前输入语音确定回复文本;根据所述回复文本、所述当前输入语音的上下文信息从所述语音合成参数库中确定强化语音合成信息,所述强化语音合成信息包括所述预设音色、所述预设语调和所述预设韵律节奏中的一个或多个的变化量;通过所述声学模型,根据所述基础语音合成信息和所述强化语音合成信息对所述回复文本进行语音合成。The processor is configured to determine the identity of the user according to the current input voice of the user; obtain an acoustic model from the acoustic model library according to the current input voice, and the preset information of the acoustic model includes a preset sound velocity, a preset Set two or more of volume, preset pitch, preset tone, preset intonation, and preset prosody; determine basic speech synthesis information from the speech synthesis parameter database according to the identity of the user, so The basic speech synthesis information includes a change amount of one or more of the preset sound speed, the preset volume, and the preset pitch; determining a reply text according to the current input voice; and according to the reply text, all The context information of the currently input speech determines the enhanced speech synthesis information from the speech synthesis parameter database, and the enhanced speech synthesis information includes one of the preset tone color, the preset intonation, and the preset rhythm. A plurality of changes; the response text is processed according to the basic speech synthesis information and the enhanced speech synthesis information through the acoustic model Sound synthesis.
基于第二方面,在可能实施例中,所述处理器具体用于:根据所述回复文本来确定所述回复文本的文学样式特征,所述文学样式特征包括所述回复文本中的部分或全部内容的句子个数、每句字数和句子字数的排列顺序中的一个或多个;根据所述回复文本涉及的文学样式特征从所述语音合成参数库中选取对应的预设韵律节奏的变化量;其中,所述文学样式特征与所述预设韵律节奏的变化量之间具有对应关系,所述预设韵律节奏的变化量表 示所述回复文本的部分或全部内容中的字符的朗读时长、朗读停顿位置、朗读停顿时间、重音各自的变化。Based on the second aspect, in a possible embodiment, the processor is specifically configured to determine a literary style feature of the reply text according to the reply text, where the literary style feature includes part or all of the reply text One or more of the number of sentences of the content, the number of words per sentence, and the arrangement order of the number of sentences; selecting the corresponding amount of change in the preset rhythm from the speech synthesis parameter database according to the literary style characteristics involved in the reply text Wherein, there is a correspondence between the literary style feature and the amount of change in the preset rhythm, and the amount of change in the preset rhythm represents the reading duration of characters in some or all of the content of the reply text, Changes in the reading pause position, reading pause time, and stress.
基于第二方面,在可能实施例中,所选取的声学模型的所述预设信息还包括语言风格特征,所述语言风格特征具体包括口头禅、对特定场景的应答方式、智慧类型、性格类型、夹杂的流行语言或方言、对特定人物的称谓中的一个或多个。Based on the second aspect, in a possible embodiment, the preset information of the selected acoustic model further includes a language style feature, and the language style feature specifically includes a mantra, a response mode to a specific scene, a type of wisdom, a type of personality, One or more of mixed popular languages or dialects, and titles for specific characters.
基于第二方面,在可能实施例中,所述声学模型库中的声学模型有多个;所述处理器具体用于:根据所述用户的身份确定所述用户的喜好;根据所述用户的喜好从所述声学模型库中选取声学模型。Based on the second aspect, in a possible embodiment, there are multiple acoustic models in the acoustic model library; the processor is specifically configured to: determine the preferences of the user according to the identity of the user; and according to the user's It is preferred to select an acoustic model from the acoustic model library.
基于第二方面,在可能实施例中,所述声学模型库中的声学模型有多个,每个声学模型分别具有一个声模标识;所述处理器具体用于:根据所述当前输入语音的内容,确定与所述当前输入语音的内容相关的声模标识;从所述声学模型库中选取对应于所述声模标识的声学模型。Based on the second aspect, in a possible embodiment, there are multiple acoustic models in the acoustic model library, and each acoustic model has an acoustic mode identifier; the processor is specifically configured to: Content, determining an acoustic mode identification related to the content of the currently input speech; selecting an acoustic model corresponding to the acoustic mode identification from the acoustic model library.
基于第二方面,在可能实施例中,所述声学模型库中的声学模型有多个;所述处理器具体用于:根据所述用户的身份选取所述声学模型中的多个声学模型;确定所述多个声学模型中的各个声学模型的权重值;其中,所述各个声学模型的权重值是用户预先设置的,或者,所述各个声学模型的权重值是预先根据所述用户的喜好而确定的;将所述各个声学模型基于所述权重值进行融合,获得融合后的声学模型。Based on the second aspect, in a possible embodiment, there are multiple acoustic models in the acoustic model library; the processor is specifically configured to: select multiple acoustic models in the acoustic model according to the identity of the user; Determining a weight value of each of the plurality of acoustic models; wherein the weight value of each of the acoustic models is preset by a user, or the weight value of each of the acoustic models is based on a preference of the user in advance It is determined; the respective acoustic models are fused based on the weight values to obtain a fused acoustic model.
基于第二方面,在可能实施例中,所述处理器还用于:在根据用户的当前输入语音确定所述用户的身份之前,根据所述用户的历史输入语音确定目标字符与用户偏好读音之间的对应关系,将所述目标字符与用户偏好读音之间的对应关系关联所述用户的身份,并将所述目标字符与用户偏好读音之间的对应关系保存到所述语音合成参数库;所述处理器还具体用于:当所述回复文本中存在与所述用户的身份关联的所述目标字符时,通过所述声学模型,根据所述目标字符与用户偏好读音之间的对应关系、所述基础语音合成信息和所述强化语音合成信息对所述回复文本进行语音合成。Based on the second aspect, in a possible embodiment, the processor is further configured to: before determining the identity of the user based on the user's current input voice, determine the target character and the user's preferred pronunciation based on the user's historical input voice. The corresponding relationship between the target character and the user's preferred pronunciation is associated with the identity of the user, and the corresponding relationship between the target character and the user's preferred pronunciation is saved to the speech synthesis parameter database; The processor is further specifically configured to: when the target character associated with the identity of the user exists in the reply text, according to a correspondence relationship between the target character and a user's preferred pronunciation through the acoustic model , The basic speech synthesis information and the enhanced speech synthesis information perform speech synthesis on the reply text.
基于第二方面,在可能实施例中,所述语音合成参数库还保存有音乐库;所述处理器还用于:根据所述回复文本从所述音乐库中选取背景音效,所述背景音效为音乐或声音特效;所述处理器还具体用于:通过所述声学模型,根据所述背景音效、所述基础语音合成信息和所述强化语音合成信息对所述回复文本进行语音合成。Based on the second aspect, in a possible embodiment, the speech synthesis parameter database further stores a music library; the processor is further configured to: select a background sound effect from the music library according to the reply text, and the background sound effect Is a music or sound special effect; the processor is further specifically configured to perform speech synthesis on the reply text according to the background sound effect, the basic speech synthesis information, and the enhanced speech synthesis information through the acoustic model.
基于第二方面,在可能实施例中,所述背景音效具有一个或多个情感极性类型的标识和情感强度的标识;所述情感极性类型的标识用于指示以下至少一种情感:快乐、喜欢、悲伤、惊讶、愤怒、恐惧、厌恶;所述情感强度的标识用于指示所述至少一种情感各自的程度值;所述处理器具体用于:将所述回复文本的内容拆分成多个子内容,分别确定各个子内容的情感极性类型和情感强度;根据所述各个子内容的情感极性类型和情感强度,在所述音乐库中选取最匹配的背景音效;其中,所述最匹配的背景音效包括多个子片段,各个子片段分别具有情感极性类型的标识和情感强度的标识,所述各个子片段具有的情感极性类型的标识所指示的情感极性类型分别与所述各个子内容的情感极性类型相同,且所述各个子片段具有的情感强度的标识所指示的情感强度之间的变化趋势与所述各个子内容的 情感强度之间的变化趋势相一致。Based on the second aspect, in a possible embodiment, the background sound effect has one or more identifiers of emotional polarity types and identifiers of emotional strength; the identifiers of the emotional polarity types are used to indicate at least one of the following emotions: happiness , Like, sadness, surprise, anger, fear, disgust; the identifier of the emotional intensity is used to indicate the respective value of the at least one emotion; the processor is specifically configured to: split the content of the reply text Into a plurality of sub-contents, and determine the emotional polarity type and emotional intensity of each sub-content respectively; according to the emotional polarity types and emotional intensity of each sub-content, select the most matching background sound effect in the music library; The best-matching background sound effect includes a plurality of sub-segments, each of which has an identification of an emotional polarity type and an identification of an emotional intensity, and the type of emotional polarity indicated by the identification of the emotional polarity type that each of the sub-segments has respectively corresponds to The types of emotion polarity of each sub-content are the same, and the intensity of emotion indicated by the identifier of the intensity of emotion of each sub-segment Trend between the trend between the intensity of the emotion and the respective sub-content consistent.
基于第二方面,在可能实施例中,该设备还可包括音频电路。其中:音频电路可提供设备与用户之间的音频接口,音频电路可进一步连接有扬声器和传声器。一方面,传声器可收集用户的声音信号,并将收集的声音信号转换为电信号,由音频电路接收后转换为音频数据(即形成用户的输入语音),再将音频数据传输至处理器进行语音处理,另一方面,处理器2011基于用户的输入语音来合成回复语音后,传输至音频电路,音频电路可将接收到的音频数据(即回复语音)转换后的电信号,进而传输到扬声器,由扬声器转换为声音信号输出。Based on the second aspect, in a possible embodiment, the device may further include an audio circuit. Among them: the audio circuit can provide an audio interface between the device and the user, and the audio circuit can further be connected with a speaker and a microphone. On the one hand, the microphone can collect the user's voice signals, and convert the collected voice signals into electrical signals, which are received by the audio circuit and converted into audio data (that is, forming the user's input voice), and then the audio data is transmitted to the processor for voice Processing, on the other hand, the processor 2011 synthesizes the reply speech based on the user's input speech and transmits it to the audio circuit. The audio circuit can convert the received audio data (that is, the reply speech) to an electrical signal, and then transmit it to the speaker. The sound signal is converted by the speaker and output.
第三方面,本发明实施例提供了一种语音合成设备,其特征在于,所述语音合成设备包括:语音识别模块,语音对话模块和语音合成模块,其中:In a third aspect, an embodiment of the present invention provides a speech synthesis device, which is characterized in that the speech synthesis device includes a speech recognition module, a speech dialogue module, and a speech synthesis module, wherein:
语音识别模块,用于接收用户的当前输入语音;A voice recognition module for receiving a user's current input voice;
语音对话模块,用于根据用户的当前输入语音确定所述用户的身份;根据所述用户的身份确定基础语音合成信息,所述基础语音合成信息包括声学模型的预设音速、预设音量和预设音高中的一个或多个的变化量;根据所述当前输入语音确定回复文本;根据所述回复文本、上下文信息确定强化语音合成信息,所述强化语音合成信息包括所述声学模型的预设音色、预设语调和预设韵律节奏中的一个或多个的变化量;A voice dialogue module, configured to determine the identity of the user based on the user's current input voice; determine basic speech synthesis information based on the identity of the user, and the basic speech synthesis information includes a preset sound velocity, a preset volume, and a preset of an acoustic model. Set one or more changes in pitch; determine a reply text based on the current input speech; determine enhanced speech synthesis information based on the reply text and context information, the enhanced speech synthesis information including a preset of the acoustic model The amount of change in one or more of timbre, preset intonation, and preset rhythm;
语音合成模块,用于根据所述当前输入语音从预设的声学模型库中获得所述声学模型,所述声学模型的预设信息包括所述预设音速、所述预设音量、所述预设音高、所述预设音色、所述预设语调和所述预设韵律节奏;通过所述声学模型,根据所述基础语音合成信息和所述强化语音合成信息对所述回复文本进行语音合成。A speech synthesis module, configured to obtain the acoustic model from a preset acoustic model library according to the current input voice, and the preset information of the acoustic model includes the preset sound speed, the preset volume, the preset Setting a pitch, the preset tone color, the preset intonation, and the preset prosody rhythm; and using the acoustic model to voice the reply text according to the basic speech synthesis information and the enhanced speech synthesis information synthesis.
上述语音识别模块、语音对话模块和语音合成模块具体用于实现第一方面所描述的语音合成方法。The speech recognition module, speech dialogue module, and speech synthesis module are specifically configured to implement the speech synthesis method described in the first aspect.
第四方面,本发明实施例提供了一种计算机可读存储介质,所述计算机可读存储介质中存储有指令,当其在计算机上运行时,使得计算机执行上述第一方面所述的方法。According to a fourth aspect, an embodiment of the present invention provides a computer-readable storage medium, where the computer-readable storage medium stores instructions, and when the computer-readable storage medium is run on a computer, causes the computer to execute the method described in the first aspect.
第五方面,本发明实施例提供了一种包含指令的计算机程序产品,当其在计算机上运行时,使得计算机执行上述第一方面所述的方法。In a fifth aspect, an embodiment of the present invention provides a computer program product containing instructions, which when run on a computer, causes the computer to execute the method described in the first aspect above.
可以看到,实施本发明实施例的技术方案,终端能够基于对话交互的回复文本以及对话上下文信息,为不同的用户选择不同的TTS参数,从而自动结合用户的喜好以及对话情景以生成不同风格的回复语音,向不同用户提供个性化的语音合成效果,大大提升用户与终端的语音交互体验,改善人机对话的时效性。此外,终端也允许用户通过语音实时调教终端的语音应答系统,更新与用户身份、喜好相关联的TTS参数,使得调教出来的终端更加贴近用户的交互喜好,最大化提升用户交互体验。It can be seen that by implementing the technical solution of the embodiment of the present invention, the terminal can select different TTS parameters for different users based on the reply text of the dialogue interaction and the dialogue context information, so as to automatically combine user preferences and dialogue scenarios to generate different styles of Reply to the voice, provide personalized speech synthesis effect to different users, greatly improve the voice interaction experience between the user and the terminal, and improve the timeliness of human-machine dialogue. In addition, the terminal also allows the user to tune the terminal's voice response system in real time by voice, and update the TTS parameters associated with the user's identity and preferences, making the tuned terminal closer to the user's interaction preferences and maximizing the user's interactive experience.
附图说明BRIEF DESCRIPTION OF THE DRAWINGS
为了更清楚地说明本发明实施例或背景技术中的技术方案,下面将对本发明实施例或背景技术中所需要使用的附图进行说明。In order to more clearly explain the technical solutions in the embodiments of the present invention or the background art, the drawings that are needed in the embodiments of the present invention or the background art will be described below.
图1是本发明实施例涉及的语音的基本物理要素的示意图;FIG. 1 is a schematic diagram of basic physical elements of speech according to an embodiment of the present invention;
图2是本发明实施例提供的一种系统架构的示意图;2 is a schematic diagram of a system architecture according to an embodiment of the present invention;
图3是本发明实施例提供的又一种系统架构的示意图;3 is a schematic diagram of still another system architecture according to an embodiment of the present invention;
图4是本发明实施例提供的一种系统架构以及终端设备的结构示意图;4 is a schematic structural diagram of a system architecture and a terminal device according to an embodiment of the present invention;
图5是本发明实施例提供的TTS参数库的示意图;5 is a schematic diagram of a TTS parameter database provided by an embodiment of the present invention;
图6是本发明实施例提供的声学模型库的示意图;6 is a schematic diagram of an acoustic model library provided by an embodiment of the present invention;
图7是本发明实施例提供的一种语音合成流程的示意图;7 is a schematic diagram of a speech synthesis process according to an embodiment of the present invention;
图8是本发明实施例提供的一种对回复文本进行语音合成的示意图;8 is a schematic diagram of speech synthesis of a reply text provided by an embodiment of the present invention;
图9是本发明实施例提供的又一种系统架构以及终端设备的结构示意图;9 is a schematic structural diagram of still another system architecture and a terminal device according to an embodiment of the present invention;
图10是本发明实施例提供的一种语音合成方法的流程示意图;10 is a schematic flowchart of a speech synthesis method according to an embodiment of the present invention;
图11是本发明实施例提供的一种与用户身份关联的基础TTS参数的示例性图表;11 is an exemplary diagram of basic TTS parameters associated with a user identity according to an embodiment of the present invention;
图12是本发明实施例提供的一种定制字符读音表的示例性图表;FIG. 12 is an exemplary diagram of a customized character pronunciation table provided by an embodiment of the present invention; FIG.
图13是本发明实施例提供的一种情感参数修正映射表的示例性图表;FIG. 13 is an exemplary diagram of an emotional parameter correction mapping table according to an embodiment of the present invention; FIG.
图14是本发明实施例提供的一种用户身份关联的语音情感参数的示例性图表;14 is an exemplary diagram of a speech emotion parameter associated with a user identity according to an embodiment of the present invention;
图15是本发明实施例提供的一种场景参数修正映射表的示例性图表;15 is an exemplary diagram of a scene parameter modification mapping table provided by an embodiment of the present invention;
图16是本发明实施例提供的一种用户身份关联的语音场景参数的示例性图表;FIG. 16 is an exemplary diagram of a voice scene parameter associated with a user identity according to an embodiment of the present invention; FIG.
图17-19是本发明实施例提供的一些与回复文本对应的调用指令的示例性图表;17-19 are exemplary diagrams of calling instructions corresponding to a reply text provided by an embodiment of the present invention;
图20是本发明实施例提供的一种更新定制字符读音表的方法流程示意图;20 is a schematic flowchart of a method for updating a customized character pronunciation table according to an embodiment of the present invention;
图21是本发明实施例提供的一种确定当前回复文本所需用到的TTS参数的方法流程示意图;21 is a schematic flowchart of a method for determining a TTS parameter required for a current reply text according to an embodiment of the present invention;
图22是本发明实施例提供的一种“诗词朗诵”的语音场景相关的语音合成方法的流程示意图;22 is a schematic flowchart of a speech scene-related speech synthesis method of "poem recitation" provided by an embodiment of the present invention;
图23是本发明实施例提供的一种将回复文本的内容进行韵律节奏模板对齐的示意图;FIG. 23 is a schematic diagram of aligning a rhythmic template with content of a reply text according to an embodiment of the present invention; FIG.
图24是本发明实施例提供的一种“歌曲哼唱”的语音场景相关的语音合成方法的流程示意图;FIG. 24 is a schematic flowchart of a speech scene-related speech synthesis method for a “song humming” according to an embodiment of the present invention; FIG.
图25是本发明实施例提供的一些对回复文本的内容进行节拍对齐的示意图;FIG. 25 is a schematic diagram of performing beat alignment on content of a reply text according to an embodiment of the present invention; FIG.
图26是本发明实施例提供的一种“人物模仿”的场景相关的语音合成方法的流程示意图;FIG. 26 is a schematic flowchart of a scene-related speech synthesis method for “character imitation” according to an embodiment of the present invention; FIG.
图27是本发明实施例提供的一些特定声学模型的声音类型对应声音特征的示例性图表;FIG. 27 is an exemplary diagram of sound characteristics corresponding to sound characteristics of some specific acoustic models according to an embodiment of the present invention; FIG.
图28是本发明实施例提供的一种语音特征的参数和语言风格特征的参数的选择界面示意图;FIG. 28 is a schematic diagram of an interface for selecting a parameter of a speech feature and a parameter of a language style feature according to an embodiment of the present invention; FIG.
图29是本发明实施例提供的一种叠加背景音效的场景的语音合成方法的流程示意图;29 is a schematic flowchart of a speech synthesis method for a scene with a background sound effect superimposed according to an embodiment of the present invention;
图30是本发明实施例提供的一种确定最匹配的音乐片段的示意图;FIG. 30 is a schematic diagram of determining a most matching music segment according to an embodiment of the present invention; FIG.
图31是本发明实施例提供的一种硬件设备的结构示意图。FIG. 31 is a schematic structural diagram of a hardware device according to an embodiment of the present invention.
具体实施方式detailed description
现今,随着人机对话技术的急速发展,人们对人机对话的时效性、个性化方面有了更高的要求。用户不再满足于机器“说话声音像人类”,而是期望机器为不同的用户提供个性化的语音交互。比如,当用户是一位听力不太好的老奶奶时,她会希望机器能自动提高语 音音量;比如用户希望能够像教育人一样来调教机器,使得机器的语音答复符合自己的性格、心情、爱好等等;又比如用户希望机器回复的声音更加生动有趣,对话语气符合语境情感;又比如用户希望机器回复的声音符合对话场景,例如机器自动根据对话场景进行朗诵诗歌、唱歌、讲故事等等。基于此,本发明实施例提供了语音合成方法及其相应设备,用于满足人们对于人机交互过程中,语音合成的个性化,多样化的需求。Nowadays, with the rapid development of man-machine dialogue technology, people have higher requirements on the timeliness and personalization of man-machine dialogue. Users are no longer satisfied with machines "speaking like humans," but instead expect machines to provide personalized voice interactions for different users. For example, when the user is an old lady with poor hearing, she would like the machine to automatically increase the volume of the voice; for example, the user would like to tune the machine like an educator, so that the machine's voice response matches his personality, mood, and hobby Etc .; for example, the user wants the machine to respond more vividly and interestingly, and the dialogue tone conforms to the contextual emotion; for example, the user wants the machine to respond to the dialogue scene, such as the machine automatically reads poetry, sings, tells a story, etc. . Based on this, embodiments of the present invention provide a speech synthesis method and a corresponding device, which are used to meet people's needs for personalized and diversified speech synthesis in the process of human-computer interaction.
下面结合本发明实施例中的附图对本发明实施例进行描述。本发明的实施方式部分使用的术语仅用于对本发明的具体实施例进行解释,而非旨在限定本发明。The following describes the embodiments of the present invention with reference to the accompanying drawings in the embodiments of the present invention. The terms used in the embodiments of the present invention are only used to explain specific examples of the present invention, and are not intended to limit the present invention.
为了便于理解本发明实施例的技术方案,首先解释本发明实施例涉及的相关概念。In order to facilitate understanding of the technical solutions of the embodiments of the present invention, the related concepts involved in the embodiments of the present invention are explained first.
语音(speech sound),即语言的声音,是语言交际工具的声波形式,语音实现语言的表达功能和社交功能。语音的基本物理要素主要有音强、音长、音高、音色等。参见图1,分别描述如下:Speech (sound), which is the sound of language, is the sound wave form of a language communication tool. Speech realizes the language's expression and social functions. The basic physical elements of speech include sound intensity, sound length, pitch, and tone color. See Figure 1, which are described as follows:
(1)音强(intensity),在不同场景中音强又可能被称为音量(volume)、音势、音重、重音等等。音强由声波振幅的大小决定,其与声波振幅的大小成正比,表示声音的强弱。音强在汉语里有区别词义的作用和一定的语法作用,比如音强决定了轻声、重音的区别意义。(1) Intensity. In different scenes, the sound intensity may be called volume, tone, stress, stress, and so on. The sound intensity is determined by the amplitude of the sound wave, which is directly proportional to the amplitude of the sound wave, indicating the strength of the sound. Sound intensity has the function of distinguishing the meaning of words and certain grammatical functions in Chinese. For example, sound intensity determines the meaning of soft sound and stress.
(2)音长(duration),音长表示声波振动持续时间的长短,其由发音体振动时持续时间的长短决定,振动时间越长声波就越长。音长可以用音速(speed)的概念来表征,音速表示发音速度的快慢,即音长越长则音速越慢。(2) Duration. The duration indicates the duration of the sound wave vibration. It is determined by the duration of the sound body vibration. The longer the vibration time, the longer the sound wave. The sound length can be represented by the concept of speed. The speed of sound indicates the speed of the sound. That is, the longer the sound length, the slower the sound speed.
(3)音高(pitch),有时候又称音调,音高的高低是由声波振动频率的高低决定,振动频率越高则音高越高。在汉语里,汉字的声调和语句的语调主要是由音高决定。(3) Pitch, sometimes called pitch. The pitch is determined by the frequency of the vibration of the sound wave. The higher the frequency, the higher the pitch. In Chinese, the tone of Chinese characters and the intonation of sentences are mainly determined by the pitch.
(4)音色(timbre),在不同场景中音色又可能被称为音质、音品等。音色表示声音的特色和本质,不同的音色对应于不同的声波波纹的曲折形式(声波波形)。音色是一个声音区别于其他声音的基本特征,不同人(或发音体)的音色各有区别。(4) Timbre. In different scenes, timbre may be called sound quality, voice quality, etc. The tone color represents the characteristics and nature of the sound, and different tone colors correspond to different zigzag forms (sound wave shapes) of sound wave ripples. The timbre is the basic characteristic of a sound that is different from other sounds, and the timbre of different people (or pronunciation bodies) is different.
汉语不同于西方语系,其表现在语法结构、语法规则、声学特性、韵律结构等方面。在汉语里,汉字是一字一音,即一个音节一般就是一个汉字,声调是音节结构中不可或缺的组成部分,通常用声调来表示一个音节发音时的高低升降,所以声调又叫字调。声调的形成除了主要由音高变化决定外,还表现在音长变化上。在发音过程中,发音体可随时调整音高和音长的变化,这样就会形成了不同的声调。声调担负着重要的辨义作用,例如通过声调来区别汉语语音中“题材”和“体裁”、“练习”和“联系”等等的词语意义。此外,在汉语里,每个字都有相应的基频(基音的频率,基频决定了该字基础的音高),而且,字字之间的基频还可能会互相影响,从而产生音基频的变异(即音变现象)。另外,在汉语里,连续语句的发音中间还会出现停顿,语句中的不同字会根据上下语义而采用轻音或重音。汉语的这些语法结构、语法规则、声学特性、韵律结构共同形成了汉语在语音上的抑扬顿挫、语气感情和韵律节奏。Chinese is different from the Western language family, and its manifestations are in grammatical structure, grammatical rules, acoustic characteristics, and prosodic structure. In Chinese, Chinese characters are one character and one sound, that is, a syllable is generally a Chinese character. Tones are an integral part of the syllable structure. Tones are usually used to indicate the rise and fall of a syllable, so the tone is also called the tone. . The formation of tones is mainly determined by changes in pitch, in addition to changes in pitch. During the pronunciation process, the pronunciation body can adjust the changes in pitch and length at any time, so that different tones are formed. Tones play an important role in distinguishing meanings. For example, tones are used to distinguish the meaning of the words "theme" and "genre", "exercise", and "connection" in Chinese speech. In addition, in Chinese, each character has a corresponding fundamental frequency (the frequency of the fundamental sound, which determines the pitch of the basic sound of the character), and the fundamental frequencies between the characters may also affect each other to produce a sound. The fundamental frequency variation (ie, the phenomenon of sound change). In addition, in Chinese, there is a pause in the pronunciation of consecutive sentences, and different words in the sentence will be light or accented according to the upper and lower semantics. These grammatical structures, grammatical rules, acoustic characteristics, and prosodic structures of the Chinese language together form the tone, frustration, mood, and rhythm of the Chinese language.
下面,描述本发明实施例所涉及的系统架构。本发明实施例的系统架构涉及用户和终端,其中,用户向终端输入语音,终端可通过语音应答系统对用户的语音进行处理,得到用于回复用户的语音,并将回复语音呈现给用户。本发明实施例中的终端可以是对话互动 机器人、家用/商用机器人、智能音箱、智能台灯、智能家电、智能家具、智能交通工具,还可以是应用在智能手机、笔记本电脑、平板电脑等等移动设备上的语音助手/语音对话软件。The following describes the system architecture involved in the embodiments of the present invention. The system architecture of the embodiment of the present invention relates to a user and a terminal. The user inputs a voice to the terminal, and the terminal can process the user's voice through a voice response system to obtain a voice for the user and present the reply to the user. The terminal in the embodiment of the present invention may be a dialog interactive robot, a home / commercial robot, a smart speaker, a smart table lamp, a smart home appliance, a smart furniture, a smart vehicle, or a mobile phone, a notebook computer, a tablet computer, etc. Voice assistant / voice conversation software on the device.
举例来说,在一种应用场景中,参见图2,所述终端为机器人,用户向机器人发出语音(如用户直接向机器人讲话),机器人向用户回复语音作为应答(如机器人通过蜂鸣器播放所回复的语音),从而实现了用户与机器人之间的人机对话。For example, in an application scenario, referring to FIG. 2, the terminal is a robot, and the user sends a voice to the robot (for example, the user speaks directly to the robot), and the robot responds to the user with a voice (for example, the robot plays through a buzzer) Replied voice), thereby realizing a human-machine dialogue between the user and the robot.
又举例来说,在又一种应用场景中,参见图3,所述终端为应用在智能手机上的语音助手,用户向语音助手发出语音(如用户触发智能手机上所显示的语音助手相关图标进行讲话),语音助手向用户回复语音作为应答(如语音通过屏幕显示语音信息,以及通过蜂鸣器播放所回复的语音),从而实现了用户与语音助手之间的交互对话。As another example, in another application scenario, referring to FIG. 3, the terminal is a voice assistant applied to a smart phone, and the user sends a voice to the voice assistant (for example, the user triggers a voice assistant related icon displayed on the smart phone) Speech), the voice assistant responds to the user as a response (for example, the voice displays the voice message on the screen, and the bouncer plays the reply voice), thereby realizing the interactive dialogue between the user and the voice assistant.
另外,需要说明的是,所述终端还可能是服务器,比如在又一应用场景中,终端向智能手机发出语音,智能手机将语音信息传输至服务器,服务器根据语音信息得到回复语音,将回复语音返回至智能手机,智能手机再将回复语音呈现给用户(如通过屏幕显示语音信息,以及通过蜂鸣器播放所回复的语音等等),从而实现了用户与服务器之间的交互对话。In addition, it should be noted that the terminal may also be a server. For example, in another application scenario, the terminal sends a voice to a smart phone, the smart phone transmits voice information to the server, and the server obtains a reply voice based on the voice information, and the reply voice Returning to the smart phone, the smart phone presents the reply voice to the user (such as displaying voice information on the screen and playing the reply voice through the buzzer, etc.), thereby realizing the interactive dialogue between the user and the server.
下面详细描述上述系统架构中终端的语音应答系统。The following describes the terminal's voice response system in the above system architecture in detail.
参见图4,图4示出了一种系统架构中终端的语音应答系统10,如图4所示,所述语音应答系统10包括语音识别模块101、语音对话模块102和语音合成模块103。各个模块功能描述如下:Referring to FIG. 4, FIG. 4 shows a voice response system 10 of a terminal in a system architecture. As shown in FIG. 4, the voice response system 10 includes a voice recognition module 101, a voice dialog module 102, and a voice synthesis module 103. The functions of each module are described as follows:
(1)语音识别(Automated speech recognition,ASR)模块101,ASR模块101用于识别用户输入语音的内容,将语音内容识别成文本,实现“语音”到“文字”的转换。(1) Speech recognition (Automated speech recognition, ASR) module 101. The ASR module 101 is used to recognize the content of a user's input speech, recognize the content of the speech into text, and realize the conversion from "speech" to "text".
(2)语音对话模块102,语音对话模块102可用于基于ASR模块101输入的识别文本生成回复文本,将回复文本传输至语音合成模块103;语音对话模块102还用于确定回复文本对应的个性化的TTS参数,以便于后续语音合成模块103基于相关TTS参数对回复文本进行语音合成。在一具体实施例中,语音对话模块102可具体包括以下模块:(2) The voice dialogue module 102, which can be used to generate a reply text based on the recognition text input by the ASR module 101, and transmit the reply text to the voice synthesis module 103; the voice dialogue module 102 is also used to determine the personalization corresponding to the reply text TTS parameters of the mobile phone to facilitate subsequent speech synthesis module 103 to perform speech synthesis on the reply text based on the relevant TTS parameters. In a specific embodiment, the voice dialog module 102 may specifically include the following modules:
自然语言理解(Natural Language Understanding,NLU)模块1021,NLU模块1021可用于对ASR模块101输入的识别文本进行语法分析和语义分析,从而理解用户说话(语音)的内容。Natural Language Understanding (NLU) module 1021, the NLU module 1021 can be used to perform grammatical analysis and semantic analysis on the recognition text input by the ASR module 101, so as to understand the content of the user's speech (voice).
自然语言生成(Natural Language Generation,NLG)模块1022,NLG模块1022可用于根据用户说话的内容以及上下文信息生成对应的回复文本。Natural Language Generation (NLG) module 1022, the NLG module 1022 can be used to generate a corresponding reply text based on the content of the user's speech and context information.
对话管理(Dialogue Management,DM)模块1023,DM模块1023用来负责当前会话状态跟踪和对话策略的控制。A dialog management (Dialogue Management, DM) module 1023 is used to track the current session state and control the dialog strategy.
用户管理(User Management,UM)模块1024,UM模块1024负责用户身份确认、用户信息的管理等,具体实施例中,UM模块1024可使用现有的身份识别系统(如声纹识别、人脸识别甚至多模态的生物特征)来确定用户身份。User Management (UM) module 1024. The UM module 1024 is responsible for user identity confirmation, user information management, etc. In specific embodiments, the UM module 1024 can use existing identity recognition systems (such as voiceprint recognition, face recognition Even multi-modal biometrics) to determine user identity.
意图识别模块1025:意图识别模块1025可用于识别出用户说话内容所指示的用户意图。具体实施例中,可在意图识别模块1025中加入TTS参数设置相关的语料知识,意图识别模块1025可识别出用户想要针对一个或多个TTS参数进行设置(更新)的交互意图。Intent recognition module 1025: The intent recognition module 1025 can be used to identify the user intention indicated by the user's speaking content. In a specific embodiment, corpus knowledge related to TTS parameter setting may be added to the intent recognition module 1025, and the intent recognition module 1025 may identify an interaction intention that a user wants to set (update) for one or more TTS parameters.
TTS参数库1026,如图5所示,TTS参数库1026用于存放基础TTS参数(或称基础语音合成信息)、强化TTS参数(或称强化语音合成信息)、定制字符读音表、音乐库等信息,分别描述如下: TTS parameter database 1026, as shown in FIG. 5, TTS parameter database 1026 is used to store basic TTS parameters (or basic speech synthesis information), enhanced TTS parameters (or enhanced speech synthesis information), custom character reading tables, music libraries, etc. The information is described as follows:
所述基础TTS参数表示合成语音时所使用到的声学模型的预设音速、预设音量、预设音高中的一个或多个的变化量,所述基础TTS参数与用户的身份相关联,也就是说可根据用户的身份(或者说根据用户的喜好)来组织不同的基础TTS参数。The basic TTS parameter represents a change in one or more of a preset sound velocity, a preset volume, and a preset pitch of an acoustic model used in synthesizing speech. The basic TTS parameter is associated with a user's identity, and That is to say, different basic TTS parameters can be organized according to the identity of the user (or according to the preferences of the user).
所述强化TTS参数表示合成语音时所使用到的声学模型的预设音色、预设语调、预设韵律节奏中的一个或多个的变化量,在实际应用中,所述强化TTS参数可进一步分类为语音情感参数和语音场景参数等。所述语音情感参数用于使通过声学模型合成的语音呈现出具体的情感特征,根据情感特征的不同,语音情感参数可进一步分类为中性情感、轻度高兴、中度高兴、极度高兴、轻度悲伤、中度悲伤等参数,具体实现方式可参考后文的详细描述。所述语音场景参数用于使通过声学模型合成的语音呈现出具体的场景特征,根据场景特征的不同,所述语音场景参数又可进一步划分为日常对话、诗词朗诵、歌曲哼唱、故事讲述、新闻播报等等参数,也就是说语音合成中采用这些语音场景参数将能够使合成语音呈现出日常对话、诗词朗诵、歌曲哼唱、故事讲述、新闻播报等语音场景的声音效果,具体实现方式可参考后文的详细描述。The enhanced TTS parameter represents a change in one or more of a preset tone color, a preset tone, and a preset prosody rhythm of an acoustic model used in synthesizing speech. In practical applications, the enhanced TTS parameter may further It is classified into speech emotion parameters and speech scene parameters. The speech emotion parameters are used to make the speech synthesized by the acoustic model present specific emotional characteristics. According to the different emotional characteristics, the speech emotion parameters can be further classified into neutral emotion, mild happiness, moderate happiness, extreme happiness, light For parameters such as sadness and moderate sadness, please refer to the detailed description below for specific implementation. The speech scene parameters are used to make the speech synthesized by the acoustic model present specific scene characteristics. According to different scene characteristics, the speech scene parameters can be further divided into daily conversation, poetry recitation, song humming, storytelling, News broadcast and other parameters, that is to say, the use of these voice scene parameters in speech synthesis will enable the synthesized speech to show the sound effects of daily dialogue, poetry recitation, song humming, storytelling, news broadcast and other voice scenes. The specific implementation method can be Refer to the detailed description later.
所述定制字符读音表包括目标字符与用户偏好读音之间的映射关系,所述目标字符可以是字(汉字或其他文字)、字母、数字、符号等等。所述目标字符与用户偏好读音之间的映射关系用于使通过声学模型合成的语音所涉及的目标字符具有用户所偏好的读音。所述目标字符与用户偏好读音之间的映射关系与用户的身份相关联,也就是说可根据用户的身份来组织不同的映射关系,具体实现方式可参考后文的详细描述。The customized character pronunciation table includes a mapping relationship between a target character and a user's preferred pronunciation. The target character may be a character (Chinese character or other character), a letter, a number, a symbol, or the like. The mapping relationship between the target character and the user's preferred pronunciation is used to enable the target character involved in the speech synthesized by the acoustic model to have the user's preferred pronunciation. The mapping relationship between the target character and the user's preferred pronunciation is related to the identity of the user, that is, different mapping relationships can be organized according to the identity of the user. For specific implementation, please refer to the detailed description later.
所述音乐库包括多个音乐信息,这些音乐信息用于语音合成过程中提供背景音效,所述背景音效可以是具体的音乐也可以是声音特效。所述背景音效用于使通过声学模型合成出来的语音背景中叠加有不同风格、节奏的音乐或声音效果,从而增强合成语音的表达效果(比如增强情感效果),具体实现方式可参考后文的详细描述。The music library includes a plurality of music information, and the music information is used to provide a background sound effect in a speech synthesis process. The background sound effect may be specific music or a sound special effect. The background sound effect is used to superimpose different styles and rhythms of music or sound effects on the speech background synthesized by the acoustic model, thereby enhancing the expression effect of the synthesized speech (such as enhancing the emotional effect). For specific implementation methods, refer to the following. Detailed Description.
TTS参数管理(Parameter Management,PM)模块1026:PM模块1027用于对TTS参数库中的TTS参数进行管理,管理的方式包括根据用户对TTS参数进行设置的意图对一个或多个TTS参数执行查询、新增、删除、更新(更改)、选择、获取(确定)等操作。比如在具体实施例中,PM模块1027可用于根据用户身份确定与之关联的基础TTS参数,以及根据回复文本的内容和上下文信息确定用于强化语音合成效果的强化TTS参数。TTS Parameter Management (PM) module 1026: The PM module 1027 is used to manage TTS parameters in the TTS parameter database, and the management method includes performing query on one or more TTS parameters according to the user's intention to set the TTS parameters , Add, delete, update (change), select, get (OK), etc. For example, in a specific embodiment, the PM module 1027 may be used to determine a basic TTS parameter associated with the user according to the identity of the user, and to determine an enhanced TTS parameter used to enhance the speech synthesis effect according to the content and context information of the reply text.
(3)语音合成(Text to Speech,TTS)模块103,TTS模块103用于将语音对话模块102生成的回复文本转换成回复语音,以便于将回复语音呈现给用户。TTS模块103可具体包括以下模块:(3) Speech synthesis (TTS) module 103. The TTS module 103 is used to convert the reply text generated by the voice dialog module 102 into a reply voice, so as to present the reply voice to the user. The TTS module 103 may specifically include the following modules:
指令生成模块1031,指令生成模块1031可用于根据语音对话模块102传输过来的回复文本以及TTS参数(包括基础TTS参数和强化TTS参数),生成或更新调用指令,所述调用指令可应用于TTS引擎1032。Instruction generation module 1031, instruction generation module 1031 may be configured to generate or update a call instruction based on the reply text and TTS parameters (including basic TTS parameters and enhanced TTS parameters) transmitted from the voice dialog module 102, and the call instructions may be applied to the TTS engine 1032.
TTS引擎1032,TTS引擎1032用于根据指令生成模块1031生成或更新的调用指令,从声学模型库1033调用声学模型库1033中合适的声学模型,并通过该声学模型,根据基 础TTS参数、强化TTS参数、目标字符与用户偏好读音之间的映射关系、背景音效等信息来对回复文本进行语音合成,从而生成回复语音,返回该回复语音给用户。TTS engine 1032, TTS engine 1032 is used to call or generate the appropriate instruction model from the acoustic model library 1033 according to the calling instruction generated or updated by the instruction generation module 1031, and use the acoustic model to strengthen the TTS based on the basic TTS parameters based on the acoustic model. The parameters, the mapping relationship between the target character and the user's preferred pronunciation, background sound effects, and other information are used to synthesize the reply text to generate a reply speech and return the reply speech to the user.
声学模型库1033,如图6所示,声学模型库1033中可包括多个声学模型,例如通用声学模型、以及若干个性化声学模型,等等。这些声学模型皆为神经网络模型,这些神经网络模型可预先由不同的语料进行训练而成。对于每个声学模型而言,每个声学模型皆对应有各自的预设信息,也就是说每个声学模型分别绑定一特定的预设信息。这些预设信息可作为该声学模型的基础输入信息。例如,通用声学模型的预设信息可包括其该模型的预设音速、预设音量、预设音高、预设音色、预设语调、预设韵律节奏中的两个或两个以上;个性化声学模型的预设信息除了包括该模型的预设音速、预设音量、预设音高、预设音色、预设语调、预设韵律节奏中的两个或两个以上外,还可包括其他的个性化信息,比如包括口头禅、对特定场景的应答方式、智慧类型、性格类型、夹杂的流行语言或方言、对特定人物的称谓等等语言风格特征。需要理解的是,不同声学模型的预设音速、预设音量、预设音高、预设音色、预设语调、预设韵律节奏等预设信息也各有差异,举例来说,个性化声学模型的预设信息可明显不同于通用声学模型的预设信息。本发明实施例中,声学模型能够根据预设信息以及预设信息的变化信息,将回复文本转换成回复语音。这里所说的预设信息的变化信息即表示语音合成中所选取的基础TTS参数、强化TTS参数、目标字符与用户偏好读音之间的映射关系、背景音效等信息。通过通用声学模型合成的语音呈现正常、通用对话场景下的声音效果,而通过个性化声学模型合成的语音能够“人物模仿”的对话场景的声音效果。关于“人物模仿”的对话场景的实现方法将在后文详细描述。The acoustic model library 1033, as shown in FIG. 6, the acoustic model library 1033 may include multiple acoustic models, such as a general acoustic model, and several personalized acoustic models, and so on. These acoustic models are all neural network models, and these neural network models can be trained in advance from different corpora. For each acoustic model, each acoustic model has its own preset information, that is, each acoustic model is bound to a specific preset information. These preset information can be used as the basic input information of the acoustic model. For example, the preset information of the general acoustic model may include two or more of the preset sound speed, preset volume, preset pitch, preset tone, preset tone, and preset rhythm of the model; personality The preset information of the acoustic model includes two or more of the preset sound speed, preset volume, preset pitch, preset tone, preset tone, and preset rhythm. Other personalized information, such as mantras, ways of responding to specific scenes, types of wisdom, personality types, mixed popular languages or dialects, appellations of specific characters, and other language style characteristics. It should be understood that the preset sound speed, preset volume, preset pitch, preset tone, preset tone, preset rhythm, and other preset information of different acoustic models also differ. For example, personalized acoustics The preset information of the model may be significantly different from the preset information of the general acoustic model. In the embodiment of the present invention, the acoustic model can convert the reply text into a reply voice according to the preset information and the change information of the preset information. The change information of the preset information referred to here means information such as a basic TTS parameter, an enhanced TTS parameter, a mapping relationship between a target character and a user's preferred pronunciation, and a background sound effect selected in speech synthesis. The speech synthesized through the general acoustic model presents the sound effects under normal and general dialogue scenarios, while the speech synthesized through the personalized acoustic model can "simulate" the sound effects of dialogue scenarios. The implementation method of the dialogue scene of "character imitation" will be described in detail later.
需要说明的是,可能实施例中,上述图4实施例中的各个模块可以是软件模块,这些软件模块可存储于终端设备的存储器,并由终端设备的处理器来调用存储器中的这些模块来执行语音合成方法。另外在可能实施例中,上述图4实施例中的各个模块的实现形式可以是终端设备中的硬件部件。It should be noted that, in a possible embodiment, each module in the embodiment shown in FIG. 4 may be a software module, and these software modules may be stored in the memory of the terminal device, and the processor in the terminal device calls these modules in the memory to Perform a speech synthesis method. In addition, in a possible embodiment, the implementation form of each module in the embodiment in FIG. 4 may be a hardware component in a terminal device.
下面简要描述基于图4所述的语音应答系统进行语音合成的过程。参见图7,语音应答系统获得用户的输入语音后,经由语音识别模块和语音对话模块得到回复文本,语音对话模块基于当前用户身份从TTS参数库中确定与身份关联的基础TTS参数;基于回复文本、上下文信息从TTS参数库中确定强化TTS参数、背景音效;如果回复文本中存在与用户身份关联的目标字符,则还确定目标字符对应的用户偏好读音。之后,语音合成模块基于基于用户的输入语音或者用户的喜好(用户的喜好与用户的身份相关联)或者回复文本,从声学模型库中调用合适的声学模型,并通过声学模型结合TTS参数(基础TTS参数、强化TTS参数、目标字符与用户偏好读音之间的映射关系、背景音效中的一个或多个)进行语音合成,从而生成用于呈现给用户的回复语音。The following briefly describes the process of speech synthesis based on the speech response system described in FIG. 4. Referring to FIG. 7, after the voice response system obtains the user's input voice, the response text is obtained through the voice recognition module and the voice dialogue module. The voice dialogue module determines the basic TTS parameters associated with the identity from the TTS parameter database based on the current user identity; based on the reply text The context information determines the enhanced TTS parameters and background sound effects from the TTS parameter database; if there is a target character associated with the user identity in the reply text, it also determines the user's preferred pronunciation corresponding to the target character. After that, the speech synthesis module calls the appropriate acoustic model from the acoustic model library based on the user's input speech or the user's preference (the user's preference is associated with the user's identity) or the reply text, and combines the TTS parameters (basic TTS parameters, enhanced TTS parameters, the mapping relationship between the target character and the user's preferred pronunciation, and one or more of the background sound effects) to perform speech synthesis to generate a reply speech for presentation to the user.
为了便于理解本发明实施例的方案,下面以图8为例进行说明,图8示出了一种应用场景的语音合成过程,如图8所示,该应用场景中,语音应答系统获得用户的输入语音后,经由语音识别模块和语音对话模块得到的回复文本为“今天天气很好”,语音对话模块确定了该用户身份关联的基础TTS参数,以及基于回复文本的内容和上下文信息确定了语音情感参数、语音场景参数等强化TTS参数,以及基于回复文本的内容确定了背景音效,那 么,语音合成模块可通过所选取的声学模型,基于所选取的基础TTS参数、语音情感参数、语音场景参数和背景音效对回复文本进行语音合成,即可最终生成用于回复用户的合成语音(jin1 tian1 tian1 qi4 hen3 hao3)。In order to facilitate understanding of the solution of the embodiment of the present invention, FIG. 8 is taken as an example for description below. FIG. 8 shows a speech synthesis process in an application scenario. As shown in FIG. 8, in this application scenario, the voice response system obtains the user ’s After inputting the voice, the reply text obtained through the voice recognition module and the voice dialogue module is "the weather is very good today". The voice dialogue module determines the basic TTS parameters associated with the identity of the user, and determines the voice based on the content and context information of the reply text Enhanced TTS parameters such as emotion parameters and voice scene parameters, and the background sound effect is determined based on the content of the reply text. Then, the speech synthesis module can use the selected acoustic model based on the selected basic TTS parameters, voice emotion parameters, and voice scene parameters. Synthesize the reply text with the background sound effect, and finally generate a synthesized speech (jin1, tian1, tian1, qi4, hen3) for replying to the user.
需要说明的是,图4实施例仅仅是本发明的一种具体实施方式,在本发明其他可能的实施方式中还可能包括更多或更少的功能模块,且上文所述功能模块之间还可能会进行适当的拆分、组合、变更部署方式等。It should be noted that the embodiment in FIG. 4 is only a specific implementation manner of the present invention. In other possible implementation manners of the present invention, it may also include more or fewer functional modules, and between the functional modules described above. There may also be appropriate splits, combinations, changes to deployment methods, and so on.
比如,声学模型库1033可部署于TTS引擎1032中,以更加便利于TTS引擎调用声学模型以及通过声学模型进行语音合成。For example, the acoustic model library 1033 can be deployed in the TTS engine 1032 to make it easier for the TTS engine to call the acoustic model and perform speech synthesis through the acoustic model.
比如,声学模型库1033也可部署于语音对话模块102中,或者部署在语音对话模块102之外的位置。For example, the acoustic model library 1033 may also be deployed in the voice dialogue module 102, or deployed outside the voice dialogue module 102.
比如,在一可能实施方式中,PM模块1027和TTS参数库1026还可整合在一起并独立部署在语音对话模块102之外的位置。For example, in a possible implementation manner, the PM module 1027 and the TTS parameter database 1026 may also be integrated together and independently deployed at a location outside the voice dialog module 102.
比如,在一可能实施方式中,PM模块1027还可具体部署于TTS引擎1032中,也就是说“TTS参数管理”可作为TTS引擎1032的一个功能而实现。又比如,在一可能实施例中,意图识别模块1025还可具体部署于DM模块1023中,也就是说,“意图识别”可作为DM模块1023的一个功能而实现。For example, in a possible implementation manner, the PM module 1027 may also be specifically deployed in the TTS engine 1032, that is, "TTS parameter management" may be implemented as a function of the TTS engine 1032. For another example, in a possible embodiment, the intent recognition module 1025 may also be specifically deployed in the DM module 1023, that is, "intent recognition" may be implemented as a function of the DM module 1023.
比如,在可能实施例中,TTS参数库1026可具体部署于PM模块1027中,即PM模块1027可将TTS参数按类别和用户身份进行组织和存储;或者,TTS参数库1026还可在语音对话模块102之外的位置进行独立部署;或者,声学模型库1033可在TTS模块103之外的位置进行独立部署;或者,声学模型库1033还可与TTS参数库1026部署在一起,等等。For example, in a possible embodiment, the TTS parameter database 1026 may be specifically deployed in the PM module 1027, that is, the PM module 1027 may organize and store TTS parameters by category and user identity; or, the TTS parameter database 1026 may also be used in a voice conversation Independent deployment at locations other than the module 102; or, the acoustic model library 1033 can be independently deployed at a location other than the TTS module 103; or, the acoustic model library 1033 can also be deployed with the TTS parameter library 1026, and so on.
又比如,在一可能实施方式中,如图9所示,为了丰富语音合成中TTS参数的可选择性,可将PM模块1027拆分成基础TTS参数管理模块1028和强化TTS参数管理模块1029。其中,基础TTS参数管理模块1028用于对TTS参数库1026中的基础TTS参数、定制字符读音表进行管理,管理方式包括根据用户对基础TTS参数进行设置的意图对一个或多个基础TTS参数执行查询、新增、删除、更新(更改)、选择、获取(确定)等操作,以及根据用户对目标字符对应的用户偏好读音进行设置的意图对定制字符读音表执行查询、新增、删除、更新(更改)、选择、获取(确定)等操作。在语音合成过程中,基础TTS参数管理模块1028还可用于获取与用户身份相关联的基础TTS参数、目标字符对应的用户偏好读音等等。强化TTS参数管理模块1029用于对TTS参数库1026中的强化TTS参数、音乐库进行管理,管理方式包括根据用户对强化TTS参数进行设置的意图对一个或多个强化TTS参数执行查询、新增、删除、更新(更改)、选择、获取(确定)等操作,以及根据用户的对背景音效进行设置的意图对音乐库执行查询、新增、删除、更新(更改)、选择、获取(确定)等操作。在语音合成过程中,强化TTS参数管理模块1029可根据回复文本的内容和上下文信息获取用于强化语音合成效果的强化TTS参数、背景音效。For another example, as shown in FIG. 9, in order to enrich the selectability of TTS parameters in speech synthesis, the PM module 1027 may be split into a basic TTS parameter management module 1028 and an enhanced TTS parameter management module 1029. The basic TTS parameter management module 1028 is used to manage the basic TTS parameters and customized character pronunciation tables in the TTS parameter database 1026. The management method includes executing one or more basic TTS parameters according to the user's intention to set the basic TTS parameters. Query, add, delete, update (change), select, get (OK) and other operations, and perform query, add, delete, update on the custom character pronunciation table according to the user's intention to set the user's preferred pronunciation corresponding to the target character (Change), select, get (OK), etc. During the speech synthesis process, the basic TTS parameter management module 1028 can also be used to obtain the basic TTS parameters associated with the user identity, the user's preferred pronunciation corresponding to the target character, and so on. The enhanced TTS parameter management module 1029 is used to manage the enhanced TTS parameters and music library in the TTS parameter database 1026. The management method includes performing query on one or more enhanced TTS parameters according to the user's intention to set the enhanced TTS parameters, adding new , Delete, update (change), select, get (OK), and perform queries, add, delete, update (change), select, get (OK) on the music library according to the user's intention to set the background sound effect And so on. During the speech synthesis process, the enhanced TTS parameter management module 1029 can obtain the enhanced TTS parameters and background sound effects used to enhance the speech synthesis effect according to the content and context information of the reply text.
需要说明的是,可能实施例中,上述图9实施例中的各个模块可以是软件模块,这些软件模块可存储于终端设备的存储器,并由终端设备的处理器来调用存储器中的这些模块来执行语音合成方法。另外可能实施例,上述图9实施例中的各个模块的实现形式可以是终端设备中的硬件部件。It should be noted that, in a possible embodiment, each module in the foregoing embodiment in FIG. 9 may be a software module, and these software modules may be stored in the memory of the terminal device, and the processor in the terminal device calls these modules in the memory to Perform a speech synthesis method. In another possible embodiment, the implementation form of each module in the foregoing embodiment in FIG. 9 may be a hardware component in a terminal device.
又比如,在一可能实施方式中,所述强化TTS参数管理模块1029还可部署于TTS引擎1032中,也就是说“强化TTS参数管理”可作为TTS引擎1032的一个功能而实现。For another example, in a possible implementation manner, the enhanced TTS parameter management module 1029 may also be deployed in the TTS engine 1032, that is, "enhanced TTS parameter management" may be implemented as a function of the TTS engine 1032.
还需要说明的是,为了便于本发明技术方案的理解,本文主要基于图4实施例所呈现的功能模块来描述本发明技术方案,而其他形式的功能模块实现方式将可类似地参考实现,本文不再一一赘述。It should also be noted that, in order to facilitate the understanding of the technical solution of the present invention, this document mainly describes the technical solution of the present invention based on the functional modules presented in the embodiment of FIG. 4, and the implementation of other forms of functional modules will be similarly implemented. I will not repeat them one by one.
可以看到,实施本发明实施例的技术方案,在用户与终端的人机语音交互中,ASR模块识别用户的语音为文本后,语音对话模块一方面生成相应的回复文本,另一方面能够基于对话交互的回复文本以及对话上下文信息,结合当前用户的身份、喜好以及对话情景选择个性化的TTS参数,进而TTS模块能够根据这些个性化的TTS参数生成特定风格的回复语音,向用户提供个性化的语音合成效果,大大提升用户与终端的语音交互体验,改善人机对话的时效性。此外,终端也允许用户通过语音实时调教终端,更新与用户身份、喜好相关联的TTS参数,使得调教出来的终端更加贴近用户的交互喜好,最大化提升用户交互体验。It can be seen that in implementing the technical solution of the embodiment of the present invention, in the human-machine voice interaction between the user and the terminal, after the ASR module recognizes the user's voice as text, the voice dialogue module can generate the corresponding reply text on the one hand, and can be based on Dialogue interactive response text and dialogue context information, combined with the current user's identity, preferences and dialogue context, to select personalized TTS parameters, and then the TTS module can generate specific styles of reply speech based on these personalized TTS parameters to provide users with personalized The speech synthesis effect greatly improves the voice interaction experience between the user and the terminal, and improves the timeliness of human-machine dialogue. In addition, the terminal also allows the user to tune the terminal in real time through voice, and update the TTS parameters associated with the user's identity and preferences, making the tuned terminal closer to the user's interaction preferences and maximizing the user's interactive experience.
参见图10,基于上述系统架构和语音应答系统,下面描述本发明实施例提供的语音合成方法流程,从多侧的角度进行描述,该方法流程包括但不限于以下步骤:Referring to FIG. 10, based on the above system architecture and voice response system, the following describes the process of the speech synthesis method provided by the embodiment of the present invention, which is described from a multi-faceted perspective. The method process includes but is not limited to the following steps:
步骤101.用户向终端输入语音,相应的,终端获取用户输入的语音。Step 101: The user inputs a voice to the terminal, and accordingly, the terminal obtains the voice input by the user.
本发明实施例中的终端可以是对话互动机器人、家用/商用机器人、智能音箱、智能台灯、智能家电、智能家具、智能交通工具,还可以是应用在智能手机、笔记本电脑、平板电脑等等移动设备上的语音助手/语音对话软件。具体实现还可参考图2或图3实施例的描述,这里不再赘述。The terminal in the embodiment of the present invention may be a dialog interactive robot, a home / commercial robot, a smart speaker, a smart table lamp, a smart home appliance, a smart furniture, a smart vehicle, or a mobile phone, a notebook computer, a tablet computer, etc. Voice assistant / voice conversation software on the device. For specific implementation, reference may also be made to the description of the embodiment in FIG. 2 or FIG. 3, and details are not described herein again.
步骤102.终端识别用户输入语音的内容,将语音识别成文本。Step 102: The terminal recognizes the content of the voice input by the user, and recognizes the voice as text.
具体实施例中,终端可通过其语音应答系统的ASR模块识别用户输入语音的内容,例如识别出用户输入语音的内容为:“说话太慢了,请说快一点吧”,“说话声音能不能大一点”,“‘白云深处有人家’的上一句是什么”等等。其中,ASR模块可直接使用当前商业ASR系统来具体实现,本领域技术人员已熟悉其实现方式,这里不展开描述。In a specific embodiment, the terminal can recognize the content of the user's input voice through the ASR module of its voice response system, for example, the content of the user's input voice is identified as: "Speak too slowly, please speak faster", "Can the voice be spoken? Bigger "," What was the last sentence of someone in the depths of Baiyun "and so on. Among them, the ASR module can be directly implemented by using a current commercial ASR system. Those skilled in the art are familiar with the implementation manner, and will not be described here.
步骤103.终端确定用户身份。Step 103: The terminal determines the identity of the user.
具体实施例中,终端可通过其语音应答系统的UM模块来识别用户的身份,比如,UM模块可通过声纹识别、人脸识别甚至多模态的生物特征识别的方式来确定语音输入者(即用户)的身份。可以理解的,如果终端识别出用户身份为本地已注册用户(如当前用户为xiaoming),则后续可调取该用户对应的TTS参数;如果终端无法识别用户身份,则确定所述用户为陌生用户(如当前用户为xiaohua),后续可调取默认的TTS参数。In a specific embodiment, the terminal may recognize the identity of the user through the UM module of its voice response system. For example, the UM module may determine the voice input person through voiceprint recognition, face recognition, or even multi-modal biometric recognition ( (Ie user). Understandably, if the terminal recognizes that the user identity is a locally registered user (such as the current user is xiaoming), the TTS parameters corresponding to the user can be subsequently adjusted; if the terminal cannot identify the user identity, it is determined that the user is a stranger user (If the current user is xiaohua), the default TTS parameters can be adjusted subsequently.
步骤104.终端确定用户的说话意图。Step 104. The terminal determines the user's speaking intention.
具体实施例中,终端可结合其语音应答系统的NLU模块和意图识别模块来确定用户说话意图,实现过程包括如下:NLU模块对识别文本进行文本分析,包括分词、语义分析、词性分析等,识别其中的关键字/词。例如,针对TTS参数设置相关的关键字/词可以包括“声音”、“音量”、“说话速度”、“发音”、“感情”、“朗诵”、“快”,“慢”,“高兴”,“悲伤”等等。意图识别模块结合对话上下文,对识别文本进行指代消解、句子意思补全,进 而可利用模板匹配方式或统计模型方式识别用户是否有更新TTS参数的意图,其中,所述指代消解是指在识别文本中确定代词指向哪个名词短语。In a specific embodiment, the terminal may determine the user's intention to speak in combination with the NLU module and the intent recognition module of its voice response system. The implementation process includes the following: The NLU module performs text analysis on the recognized text, including word segmentation, semantic analysis, part-of-speech analysis, etc. Keywords / words. For example, related keywords / words for TTS parameter setting may include "sound", "volume", "speaking speed", "pronouncement", "feeling", "reciting", "fast", "slow", "happy" , "Sadness" and so on. The intent recognition module combines the context of the dialogue to perform reference resolution and sentence completion, and then can use template matching or statistical model to identify whether the user has the intention to update TTS parameters. The reference resolution refers to the Recognize which noun phrase the pronoun points to in the text.
其中,对于模板匹配的方式,可首先分析常用指令中出现的关键字和以及词的组合,然后构造模板/规则用于匹配特定的意图,如文本句子里出现“…声音/说/讲/读…慢/快…”这样的句式模板,则可认为用户的说话意图是需要调整该用户对应的基础TTS参数中的音速(如音速提高或降低20%);如文本句子里出现“…声音/说/讲/读…大声/小声/大/小…”这样的句式模板,则可认为用户的说话意图是需要调整该用户对应的基础TTS参数中的音量(如音量提高或降低20%);如文本句子出现“刚才/刚刚说的…中的[词1]应该念/读…[词2]”这样的句式模板,则可认为用户的说话意图是需要更正/增添该用户对应的基础TTS参数中的定制字符读音表中的发音;如文本句子出现“…感情/情感/读/讲/说…高兴/欢乐/开心/愉快…”这样的句式模板,则可认为用户的说话意图是将语音情感参数设置为“轻度高兴”;如文本句子出现一个或多个诗/词句子,或者出现“…念/读/朗诵…诗/诗歌/词…”的句式模板,则可认为用户的说话意图是将语音场景参数设置为“诗歌朗读”,等等。Among them, the template matching method can first analyze the combinations of keywords and words appearing in common instructions, and then construct templates / rules to match specific intents, such as "... sound / speak / speak / read in text sentences … Slow / fast… ”, it can be considered that the user ’s speaking intention is to adjust the speed of sound in the basic TTS parameters corresponding to the user (such as the speed of sound is increased or decreased by 20%); for example,“… sound appears in the text sentence / Speak / speak / read… loud / whisper / loud / small… ”, you can consider that the user ’s speaking intention is to adjust the volume in the basic TTS parameters corresponding to the user (such as increasing or decreasing the volume by 20% ); If a sentence template such as "[word 1] should be pronounced / read ... [word 2]" in the text sentence just appeared, the user's intention to speak can be considered to need to correct / add the corresponding user The pronunciation of the custom character pronunciation table in the basic TTS parameters of the user; if a sentence template such as “… emotional / emotional / read / speak / speak… happy / happy / happy / happy…” appears in the text sentence, the user ’s Speaking intention is to The emotional parameter of the voice is set to "lightly happy"; if one or more poems / words appear in the text sentence, or a sentence pattern of "... read / read / reciting ... poem / poem / word ..." appears, the user can be considered as a user The intention of speaking is to set the speech scene parameters to "poetry reading", and so on.
其中,对于统计模型的方法,可预先收集各种用户说话意图对应的常用的说法,对每一种说法意图进行类别标注,形成包含多种标注数据的训练集的,而后利用训练集的标注数据训练机器学习模型,训练算法包括但不限于支持向量机(Support Vector Machines,SVM)算法,朴素贝叶斯(Naive Bayes)算法,决策树(Decision Tree)算法、神经网络(Neural Network,NN)算法等。这样,模型训练好之后,在需要确定用户的说话意图时,将用户说话对应文本句子的关键字/词输入至该模型,就可以确定该文本句子对应的说话意图。进一步地,还可以预先对训练好的模型基于对话领域或话题类型进行分类,如划分成“天气”类、“诗词类”、“歌曲类”、“新闻类”、“生活交际类”、“电影”类、“体育”类等等的模型,这样,意图识别模块可根据当前对话状态以及文本句子的关键字/词确定对话领域或话题类型,然后意图识别模块优先将关键字/词作为输入对应的对话领域模型或话题类型模型中,进而确定该文本句子对应的说话意图。Among them, for the statistical model method, common expressions corresponding to various user intents can be collected in advance, and each statement intent is classified to form a training set containing multiple labeled data, and then the labeled data of the training set is used. Train machine learning models. Training algorithms include, but are not limited to, Support Vector Machines (SVM) algorithms, Naive Bayes algorithms, Decision Tree algorithms, Neural Networks (NN) algorithms. Wait. In this way, after the model is trained, when the user's speaking intent needs to be determined, the keywords / words corresponding to the user's spoken text sentence are input to the model to determine the speaking intent corresponding to the text sentence. Further, the trained model can be classified in advance based on the dialogue area or topic type, such as being divided into "weather" category, "poem category", "song category", "news category", "life communication category", " Movie "," sports "and so on. In this way, the intent recognition module can determine the conversation area or topic type based on the current conversation state and the keywords / words of the text sentence, and then the intent recognition module takes the keywords / words as input first. In the corresponding dialog domain model or topic type model, the corresponding speech intent of the text sentence is determined.
步骤105.终端判断用户的说话意图是否为进行TTS参数的设置。Step 105: The terminal determines whether the user's speaking intention is to set the TTS parameters.
步骤106.如果判断说话意图为进行TTS参数的设置(如更新,删除,新增等操作),则终端根据说话意图的指示执行TTS参数的设置操作。所述TTS参数包括与用户身份关联的音速、音量、音高各自的变化量等基础TTS参数以及定制字符读音表等;所述TTS参数还包括语音情感参数、语音场景参数等强化TTS参数、背景音效等。需要理解的是,在可能的实现中,强化TTS参数可能会与用户的身份相关联,也可能不需关联用户的身份。所述设置操作相应为新增TTS参数、删除TTS参数、更新(更改)TTS参数等操作。Step 106. If it is determined that the speaking intention is to set the TTS parameters (such as update, delete, and add operations), the terminal executes the setting operation of the TTS parameters according to the instruction of the speaking intention. The TTS parameters include basic TTS parameters such as the speed of sound, volume, and individual changes associated with the identity of the user, as well as custom character pronunciation tables, etc .; the TTS parameters also include enhanced emotional TTS parameters, voice scene parameters, and other TTS parameters and background. Sound effects, etc. It should be understood that in a possible implementation, the enhanced TTS parameter may be associated with the identity of the user, or it may not need to be associated with the identity of the user. The setting operations are operations such as adding TTS parameters, deleting TTS parameters, and updating (changing) TTS parameters.
具体实施例中,若用户为本地已注册用户,则可对与该用户身份相关联的TTS参数执行更新操作。若用户为未注册用户,那么可先为该用户创建/注册本地用户身份,该本地用户身份初始关联默认的TTS参数,然后再对与该用户身份相关联的默认TTS参数执行更新操作。In a specific embodiment, if the user is a locally registered user, an update operation may be performed on a TTS parameter associated with the user identity. If the user is an unregistered user, a local user identity may be created / registered for the user. The local user identity is initially associated with the default TTS parameters, and then the default TTS parameters associated with the user identity are updated.
具体实施例中,终端可通过语音应答系统的PM模块,按照语音对话模块(如具体为NLU模块和/或意图识别模块)发出的TTS参数更新指令,对TTS参数库中用户身份关联的TTS参数执行更新操作。In a specific embodiment, the terminal may use the PM module of the voice response system to update the TTS parameter associated with the user identity in the TTS parameter database according to the TTS parameter update instruction issued by the voice dialogue module (such as the NLU module and / or the intent recognition module). Perform the update operation.
举例来说,本发明实施例中,基础TTS参数表示相对于基础语音物理要素的变化量(或称变化系数),对于基础TTS参数中预设音速、预设音量、预设音高的变化量,可按用户身份进行组织和存储,参见图11,图11示出了一种与用户身份关联的基础TTS参数的示例性图表,如图11所示,图表中的数组表示相对于语音合成中所选取的声学模型的预设音速、预设音量、预设音高的默认值的上升/下降比例。图表包括了未注册用户和已注册用户。其中,未注册用户表示尚未进行身份注册或认证未通过的用户,其关联的预设音速、预设音量、预设音高的变化量均为默认值0;已注册用户表示已进行身份注册且认证通过的用户,例如包括“xiaoming”、“xiaoming_mom”、“xiaoming_grandma”、“xiaoming_dad”等。可以看到,对于用户“xiaoming_grandma”,其关联的音速、音量、音高的基础TTS参数分别为:“-40%,+40%,+20%”,也就是说,在合成对应该用户的语音时,回复文本对应的基础语音将会降低40%的音速、增加40%的音量以及增加20%的音高。另外,还可以看到,这些已注册用户对应的预设音速、预设音量、预设音高的变化量可被执行新增、更正/更改、删除等操作,比如终端基于“xiaoming”的说话意图“增加音量”,将“xiaoming”关联预设音量的变化量在默认值“0”基础上提升到“+20%”;又比如终端基于“xiaoming_mom”的说话意图“降低音速”,将“xiaoming_mom”关联的预设音速的变化量在原先的“+40%”基础上降低到“+20%”,等等。For example, in the embodiment of the present invention, the basic TTS parameter represents the amount of change (or change coefficient) relative to the physical elements of the basic speech. For the basic TTS parameter, the amount of change in the preset sound speed, preset volume, and preset pitch is , Can be organized and stored according to user identity, see FIG. 11, which shows an exemplary chart of basic TTS parameters associated with user identity, as shown in FIG. 11, the array in the chart represents The rising / falling ratio of the preset sound speed, preset volume, and preset pitch of the selected acoustic model. The chart includes unregistered users and registered users. Among them, an unregistered user means a user who has not yet performed identity registration or failed authentication, and the associated preset sound speed, preset volume, and preset pitch change are all default values of 0; registered users indicate that identity registration has been performed and The authenticated users include, for example, "xiaoming", "xiaoming_mom", "xiaoming_grandma", "xiaoming_dad", and so on. It can be seen that for the user "xiaoming_grandma", the basic TTS parameters of the associated sound speed, volume, and pitch are: "-40%, + 40%, + 20%", that is, the user's When speaking, the basic speech corresponding to the reply text will reduce the speed of sound by 40%, increase the volume by 40%, and increase the pitch by 20%. In addition, you can also see that the registered users ’preset sound speed, preset volume, and preset pitch changes can be added, corrected / changed, and deleted. For example, the terminal speaks based on" xiaoming " The intention is to "increase the volume" and increase the change in the preset volume associated with "xiaoming" to "+ 20%" based on the default value "0"; for example, the terminal intends to "reduce the speed of sound" based on "xiaoming_mom" The amount of change in the preset sound speed associated with "xiaoming_mom" is reduced to "+ 20%" from the original "+ 40%", and so on.
又举例来说,本发明实施例中,对于定制字符读音表,可按用户身份进行组织和存储,参见图12,图12示出了一种与用户身份关联的定制字符读音表的示例性图表,如图12所示,未注册用户对应的定制字符读音表为空,而已注册用户对应的定制字符读音表可基于该用户的喜好进行新增、更改、删除等操作。设置操作的对象可以是终端容易误读的或者用户所喜好的字、人名/地名、字母、特殊符号等等。定制字符读音表包括目标字符(串)与用户偏好读音(pronunciation)之间的映射关系,目标字符(串)可以是字(汉字或外文)、词语、短语、句子,还可以是数字、符号(如中文字符、外文字符、颜文字、标点符号、特殊符号…)等等。比如终端原本预置读音表“小猪佩奇”读音为“xiao3 zhu1 pei4 qi2”,如果“xiaoming”的说话意图为将短语“小猪佩奇”中的“奇”的读音设置为“ki1”,则终端将“小猪佩奇”与“xiao3 zhu1 pei4 ki1”作为一映射关系写入“xiaoming”关联的定制字符读音表。可以理解的是,图12所示图表仅仅是示例而非限制。For another example, in the embodiment of the present invention, the customized character pronunciation table may be organized and stored according to user identity. Referring to FIG. 12, FIG. 12 shows an exemplary diagram of a customized character pronunciation table associated with a user identity. As shown in FIG. 12, the custom character pronunciation table corresponding to an unregistered user is empty, and the custom character pronunciation table corresponding to a registered user can be added, changed, or deleted based on the user's preference. The object of the setting operation may be a character, a person / place name, a letter, a special symbol, etc. that are easily misunderstood by the terminal or the user likes. The customized character pronunciation table includes the mapping relationship between the target character (string) and the user's preferred pronunciation. The target character (string) can be a word (Chinese character or foreign language), a word, a phrase, a sentence, or a number, a symbol ( Such as Chinese characters, foreign characters, emoji, punctuation marks, special symbols ...) and so on. For example, the terminal ’s original pronunciation table “Piggy Page” is pronounced “xiao3, zhu1, pei4, qi2”. If “xiaoming” is intended to be spoken, the pronunciation of “odd” in the phrase “piggy page” is set to “ki1”. , The terminal writes a "piglet page" and "xiao3 zhu1 pei4 ki1" into a custom character pronunciation table associated with "xiaoming" as a mapping relationship. It can be understood that the chart shown in FIG. 12 is merely an example and not a limitation.
又举例来说,本发明实施例中,对于强化TTS参数中的语音情感参数,语音情感参数表征了语音中的语调变化,所谓语调变化,是指语音中声音的音高的升降、音量的轻重、音速的快慢、语音文字的停顿位置/停顿时间等等的变化。这些变化对于语音的表情达意而言,具有非常重要的作用,通过语调的变化能够使得语音呈现出来高兴、喜悦、难过、悲哀、愁苦、犹豫、轻松、坚定、豪迈等复杂的情感。For another example, in the embodiment of the present invention, for the speech emotion parameter in the enhanced TTS parameter, the speech emotion parameter represents the change of intonation in the voice. The so-called tone change refers to the rise and fall of the pitch of the sound in the voice, and the importance of the volume. , The speed of sound, the pause / dwell time of speech and text, etc. These changes have a very important effect on the expression of the voice. Through the change of intonation, the voice can present complex emotions such as joy, joy, sadness, sadness, sadness, hesitation, ease, firmness, and heroism.
本发明具体实施例中,TTS参数库中维护有“语音对话模块建议的语音情感”与“语音情感参数”的映射关系,该映射关系例如为图13所示的情感参数修正映射表。基于不同的语音情感参数所合成的语音就会带上对应的情感口吻,如语音对话模块建议的语音情感为“中性情感(Neutral)”,那么语音合成模块基于中性情感的语音情感参数合成的语音将体现出中性情感的口吻(即不带任何情感特性);语音对话模块建议的语音情感为“轻度高兴(Happy_low)”,那么语音合成模块基于轻度高兴的语音情感参数合成的语音则为带 有轻度高兴的口吻;语音对话模块建议的语音情感为“轻度悲伤(Sad_low)”,那么语音合成模块基于轻度悲伤的语音情感参数合成的语音则为带有轻度悲伤的口吻,等等。可以理解的是,图13所示图表仅仅是示例而非限制。In a specific embodiment of the present invention, the TTS parameter database maintains a mapping relationship between “speech emotion suggested by the voice dialog module” and “speech emotion parameter”. The mapping relationship is, for example, the emotion parameter correction mapping table shown in FIG. 13. The speech synthesized based on different speech emotion parameters will carry the corresponding emotional muzzle. For example, the speech emotion suggested by the speech dialogue module is "Neutral", then the speech synthesis module synthesizes speech emotion parameters based on neutral emotions. The voice of the voice will reflect the tone of neutral emotion (that is, without any emotional characteristics); the voice emotion suggested by the voice dialogue module is "Happy_low", then the voice synthesis module synthesizes the voice emotion parameters based on the mildly happy voice The voice is a tone with mild happiness; the voice emotion suggested by the voice dialogue module is "Sad_low", then the voice synthesized by the voice synthesis module based on the voice emotion parameters of mild sadness is mild sadness Muzzle, wait. It can be understood that the chart shown in FIG. 13 is only an example and not a limitation.
本发明具体实施例中,语音情感参数除了与用户身份有关,还与回复文本以及上下文信息相关。在用户身份创建后,与用户身份关联的默认语音情感参数可对应为中性情感,在语音对话过程中终端可根据用户身份、回复文本以及上下文信息,综合确定在当前语音合成过程中所采用的语音情感参数。比如,如果终端判定回复文本以及上下文信息不指定语音情感,或者指定的语音情感与该用户默认的语音情感一致,则终端选择用户默认的语音情感应用到最终语音的合成,例如用户默认语音情感为“中性情感”,终端判定当前回复文本的语音合成没有指定的语音情感,则终端依旧采用“中性情感”应用到最终语音的合成;如果终端判定回复文本以及上下文信息需要指定语音情感,且指定的语音情感与该用户默认的语音情感不一致,那么终端自动将当前语音情感调整为终端所指定的语音情感,例如用户默认语音情感为“中性情感”,但是终端判定当前回复文本的语音合成需要“轻度高兴”的语音情感,那么终端采用“轻度高兴”的语音情感参数终语音的合成。In a specific embodiment of the present invention, in addition to the user's identity, the speech emotion parameters are also related to the reply text and context information. After the user identity is created, the default voice emotion parameters associated with the user identity can correspond to neutral emotions. During the voice conversation, the terminal can comprehensively determine the current voice synthesis process based on the user identity, reply text, and context information. Speech emotion parameters. For example, if the terminal determines that the response text and context information do not specify a voice emotion, or the specified voice emotion is consistent with the user's default voice emotion, the terminal selects the user's default voice emotion to apply to the final speech synthesis, for example, the user's default voice emotion is "Neutral sentiment", the terminal determines that the speech synthesis of the current reply text has no specified speech sentiment, the terminal still applies "neutral sentiment" to the synthesis of the final speech; if the terminal determines that the reply text and context information need to specify the speech sentiment, and The specified voice emotion is not consistent with the user's default voice emotion, then the terminal automatically adjusts the current voice emotion to the voice emotion specified by the terminal, for example, the user's default voice emotion is "neutral emotion", but the terminal determines the speech synthesis of the current reply text If "slightly happy" voice emotion is needed, then the terminal adopts the "slightly happy" voice emotion parameter for final speech synthesis.
具体实施例中,终端可基于用户的说话意图更新该用户身份关联的语音情感参数。如图14所示,终端可根据“xiaoming_grandma”的说话意图,更改“xiaoming_grandma”关联的语音情感参数,即从默认“中性情感”的语音情感参数更改为“轻度高兴”的语音情感参数。可以理解的是,图14所示图表仅仅是示例而非限制。In a specific embodiment, the terminal may update the voice emotion parameters associated with the identity of the user based on the user's speaking intention. As shown in FIG. 14, the terminal may change the voice emotion parameters associated with “xiaoming_grandma” according to the speaking intent of “xiaoming_grandma”, that is, change the voice emotion parameters of the “neutral emotion” to the voice emotion parameters of “lightly happy”. It can be understood that the chart shown in FIG. 14 is only an example and not a limitation.
又举例来说,本发明实施例中,对于强化TTS参数中的语音场景参数,语音场景参数表征了语音中的韵律节奏变化。所谓韵律节奏变化,是相对于普通对话的自然状态下的韵律节奏而言,具有更加明确清晰的韵律节奏和强烈情感表达,从而使得语音对话贴合特定的应用场景,韵律节奏的变化可体现在语音停顿位置/停顿时间变化、重音的位置变化、单词/单字的音长变化、单词/单字的音速变化等方面。这些韵律节奏的特定变化可具体呈现出“诗词朗诵”“歌曲哼唱(或儿歌哼唱)”“故事讲述”“新闻播报”等语音场景效果。For another example, in the embodiment of the present invention, for the speech scene parameters in the enhanced TTS parameters, the speech scene parameters represent the change of the rhythm in the speech. The so-called rhythmic rhythm change has a clearer and clearer rhythmic rhythm and strong emotional expression than the rhythmic rhythm in the natural state of ordinary dialogue, so that the voice dialog fits the specific application scenario. The change of rhythmic rhythm can be reflected in Speech pause position / pause time change, accent position change, word / word sound length change, word / word sound speed change, etc. The specific changes in these rhythmic rhythms can specifically present the voice scene effects such as "poetry recitation", "song humming (or nursery rhyme)", "story telling", and "news broadcasting".
本发明具体实施例中,TTS参数库中维护有“语音对话模块建议的语音场景”与“语音场景参数”的映射关系,该映射关系例如为图15所示的场景参数修正映射表。可以理解的,基于不同的语音场景参数所合成的语音就会体现对应的场景口吻,如基于日常对话的语音场景参数合成的语音即体现出日常对话的口吻,基于诗词朗诵的语音场景参数合成的语音即体现出诗词朗诵的口吻,基于歌曲哼唱的语音场景参数合成的语音即体现出歌曲哼唱的口吻,等等。可以理解的是,图15所示图表仅仅是示例而非限制,在可能的实施例中,还可以基于实际应用的需要设计其他语音场景参数,如故事讲解,新闻播放等。In a specific embodiment of the present invention, the TTS parameter database maintains a mapping relationship between “speech scenarios suggested by the voice dialogue module” and “speech scenario parameters”, and the mapping relationship is, for example, the scenario parameter modification mapping table shown in FIG. 15. It is understandable that the speech synthesized based on different voice scene parameters will reflect the corresponding scene muzzle. For example, the speech synthesized based on the voice scene parameters of daily dialogue reflects the tone of daily conversation, and is synthesized based on the speech scene parameters of poetry recitation. Voice reflects the tone of poetry recitation, speech synthesized based on the voice scene parameters of song humming represents the tone of song humming, and so on. It can be understood that the chart shown in FIG. 15 is merely an example and not a limitation. In a possible embodiment, other voice scene parameters may also be designed based on the needs of actual applications, such as story interpretation, news broadcast, and the like.
本发明具体实施例中,语音场景参数主要与回复文本以及上下文信息相关。参见图15,在用户身份创建后,与用户身份关联的默认语音场景参数对应的语音场景为“日常对话”,在语音对话过程中终端可根据用户身份、回复文本以及上下文信息,综合确定在当前语音合成过程中所采用的语音场景参数。比如,如果终端判定回复文本以及上下文信息不指定语音场景,或者指定的语音场景与该用户默认的语音场景一致,则终端选择用户默认的语音场景参数应用到最终语音的合成。例如,用户默认语音情感为“日常对话”,终端判定当前回复文本的语音合成没有指定的语音场景,则终端依旧采用“日常对话”应用到最终 语音的合成;如果终端判定回复文本以及上下文信息需要指定语音场景,且指定的语音场景与该用户默认的语音场景不一致,那么终端自动将当前语音场景调整为终端所指定的语音场景。例如,用户默认语音情感为“日常对话”,但是终端判定当前回复文本的语音合成需要“诗词朗诵”的语音场景,那么终端采用“诗词朗诵”对应的语音场景参数应用到最终语音的合成。In a specific embodiment of the present invention, the voice scene parameters are mainly related to the reply text and context information. Referring to FIG. 15, after the user identity is created, the voice scene corresponding to the default voice scene parameter associated with the user identity is “daily conversation”. During the voice conversation, the terminal may comprehensively determine the current situation based on the user identity, reply text, and context information. Parameters of the speech scene used in the speech synthesis process. For example, if the terminal determines that the reply text and context information do not specify a voice scene, or the designated voice scene is consistent with the user's default voice scene, the terminal selects the user's default voice scene parameters to apply to the final speech synthesis. For example, if the user ’s default voice sentiment is “daily conversation”, and the terminal determines that the speech synthesis of the current reply text has no specified speech scene, the terminal still applies “daily conversation” to the synthesis of the final speech; if the terminal determines that the reply text and context information require If the specified voice scene is inconsistent with the default voice scene of the user, the terminal automatically adjusts the current voice scene to the voice scene specified by the terminal. For example, the user's default voice emotion is "daily conversation", but the terminal determines that the speech synthesis of the current response text requires a "poem recitation" speech scene, and then the terminal applies the speech scene parameters corresponding to "poem recitation" to the final speech synthesis.
具体实施例中,终端可基于用户的说话意图更新该用户身份关联的默认语音场景参数。如图16所示,终端可根据“xiaoming_dad”的说话意图,将“xiaoming_dad”的默认语音场景参数对应的语音场景从“日常对话”更改为“诗词朗诵”。可以理解的是,图16所示图表仅仅是示例而非限制。In a specific embodiment, the terminal may update the default voice scene parameters associated with the identity of the user based on the user's speaking intention. As shown in FIG. 16, the terminal may change the voice scene corresponding to the default voice scene parameter of “xiaoming_dad” from “daily conversation” to “poem recitation” according to the speaking intention of “xiaoming_dad”. It can be understood that the chart shown in FIG. 16 is merely an example and not a limitation.
需要说明的是,关于“诗词朗诵”、“歌曲哼唱(如儿歌哼唱)”等语音场景参数的相关内容还将在后文中详细描述,这里不再赘述。It should be noted that the related content of the voice scene parameters such as "poem recitation" and "song humming (such as children's song humming)" will also be described in detail later, and will not be repeated here.
此外,为了更好地实施本步骤,在一种可能实现方式中,意图识别模块确定TTS参数设置意图后,由PM模块执行具体的更新操作,其流程可具体实施如下:PM模块维护一个参数更新意图和具体操作接口的映射表,从而根据当前识别的意图ID确定对应的操作API。例如对于增加音量意图,其调用Update-Costomized-TTS-Parameters-volume接口,其输入是用户ID和调节幅度值;又例如对于更正符号读音的意图,其调用Update-Costomized-TTS-Parameters-pron接口,其输入是用户ID和需更正读音的符号以及目标读音串,等等。若当前用户为已注册用户,则PM模块执行相关的更新接口,实施上文所描述的TTS参数更新过程。若当前用户为未注册用户,则PM模块可为该陌生用户新增一条用户信息记录,其关联的TTS参数均使用默认值,然后再对其关联的TTS参数进行参数更新。In addition, in order to better implement this step, in a possible implementation manner, after the intent recognition module determines the TTS parameter setting intent, the PM module performs a specific update operation. The process can be implemented as follows: The PM module maintains a parameter update The mapping table of the intent and the specific operation interface, so as to determine the corresponding operation API according to the currently identified intent ID. For example, for the purpose of increasing the volume, it calls the Update-Costomized-TTS-Parameters-volume interface, and its input is the user ID and the adjustment amplitude value; for example, for the intention of correcting the pronunciation of the symbol, it calls the Update-Costomized-TTS-Parameters-pron interface The input is the user ID and the symbol to be corrected, the target pronunciation string, and so on. If the current user is a registered user, the PM module executes the relevant update interface and implements the TTS parameter update process described above. If the current user is an unregistered user, the PM module can add a user information record for the strange user, and its associated TTS parameters use the default values, and then update the associated TTS parameters.
步骤107.终端结合上下文信息生成回复文本。Step 107. The terminal generates a reply text in combination with the context information.
在一实施例中,如果用户的说话意图是对TTS参数进行设置,那么,终端基于用户的说话意图进行TTS参数设置后,生成回复文本,所述回复文本主要用于将终端已完成TTS参数设置的情况告知给用户。比如,当前用户输入语音所指示的用户意图为“提高音速”或“提高音量”,则可返回设置结果对应的预设文本作为回复文本,如回复文本对应为“说话速度已经快一点了”、“音量已经调大一点了”等等。In an embodiment, if the user's speaking intention is to set the TTS parameters, the terminal generates a reply text after setting the TTS parameters based on the user's speaking intention, and the reply text is mainly used to set the terminal's completed TTS parameter setting. To the user. For example, if the user ’s intention indicated by the current user ’s voice input is “to increase the speed of sound” or “to increase the volume”, the preset text corresponding to the setting result may be returned as the reply text. For example, the reply text corresponds to “speak faster”, "The volume has been turned up a bit" and so on.
在又一实施例中,如果用户的说话意图并不是对TTS参数进行设置,那么,终端可结合用户说话的内容以及用户对话的上下文信息来生成用于答复用户的回复文本。比如,用户的输入语音的内容为“今天的天气情况如何”,则终端可查询本地网络资源或根据对话模型得到用于答复用户的回复文本,如回复文本为“今天天气很好,是个晴天”等等;用户的输入语音的内容为“‘白云深处有人家’的上一句是什么”,则终端可查询本地网络资源或根据对话模型得到回复文本“‘白云深处有人家’的上一句是‘远上寒山石径斜’”,等等。In another embodiment, if the user's speaking intention is not to set the TTS parameters, the terminal may combine the content of the user speaking and the context information of the user conversation to generate a reply text for replying to the user. For example, if the content of the user ’s input voice is "What is the weather today?", The terminal may query local network resources or obtain a reply text for replying to the user according to the conversation model. Wait; the content of the user's input voice is "What is the last sentence of someone in the depths of Baiyun", then the terminal can query local network resources or get the reply text "The last sentence of someone in the depths of Baiyun" according to the conversation model. It's 'Far on the Hanshan Stone Trail' ", and so on.
具体实施例中,终端可通过语音应答系统的NLG模块结合DM模块中的上下文信息来生成回复文本。具体实现中,回复文本生成可以通过基于检索、基于模型生成等方式实现。In a specific embodiment, the terminal may generate a reply text through the NLG module of the voice response system and the context information in the DM module. In specific implementation, the reply text generation can be implemented through retrieval-based, model-based generation, and the like.
其中,对于基于检索的回复文本生成方式,具体做法可以是:预先准备好问答及答案对的语料,而在回复生成时找出语料中同当前问题的最佳匹配,继而将其相应的答案返回作为回复文本。Among them, for the retrieval-based text generation method, the specific method can be as follows: prepare a corpus of question and answer pairs in advance, and find the best match between the corpus and the current question when generating the response, and then return the corresponding answer As reply text.
其中,对于基于模型生成的回复文本生成方式,具体做法可以是:预先根据大量的问题和答案对语料来训练出一个神经网络模型,在回复文本生成过程中,将问题作为该神经网络模型的输入,而计算出其对应的回复答案,该回复答案即可作为回复文本。For the method of generating a reply text based on a model, a specific method may be: training a neural network model according to a large number of question and answer pairs in advance, and using the question as an input to the neural network model in the process of generating the reply text. , And calculate the corresponding reply answer, and the reply answer can be used as the reply text.
步骤108.终端确定当前回复文本所需用到的TTS参数。Step 108: The terminal determines the TTS parameters required for the current reply text.
具体实施例中,终端一方面可通过语音应答系统的PM模块(或基础TTS参数管理模块)确定与当前用户身份关联的基础TTS参数,如预设音高、预设音速和预设音量对应的基础TTS参数,以及文本中目标字符(串)的读音等;另一方面,可通过语音应答系统的PM模块(或强化TTS参数管理模块)根据回复文本的内容以及上下文信息确定对应的强化TTS参数,如语音情感参数、语音场景参数、背景音效等。In specific embodiments, on the one hand, the terminal can determine the basic TTS parameters associated with the current user identity through the PM module (or basic TTS parameter management module) of the voice response system, such as the preset pitch, preset speed, and preset volume. Basic TTS parameters, and the pronunciation of target characters (strings) in the text; on the other hand, the PM module (or enhanced TTS parameter management module) of the voice response system can be used to determine the corresponding enhanced TTS parameters based on the content of the reply text and context information , Such as voice emotion parameters, voice scene parameters, background sound effects, etc.
本发明具体实施例中,适合叠加背景音效的回复文本内容可以是诗歌词曲,可以是影视台词,也可以是具有情感极性的文字。需要说明的是,关于背景音效的相关内容将在后文中详细描述,这里不再赘述。In the specific embodiment of the present invention, the content of the reply text suitable for superimposing the background sound effect may be a poem, a film or television line, or a text with emotional polarity. It should be noted that related content about background sound effects will be described in detail later, and will not be repeated here.
步骤109.终端根据所述当前输入语音从预设的声学模型库中选取声学模型。本步骤也可以在步骤108之前进行。Step 109: The terminal selects an acoustic model from a preset acoustic model library according to the current input voice. This step may also be performed before step 108.
具体的,终端预设有声学模型库,声学模型库中可包括多个声学模型,例如通用声学模型、以及若干个性化声学模型,等等。这些声学模型皆为神经网络模型,这些神经网络模型可预先由不同的语料进行训练而成。对于每个声学模型而言,每个声学模型皆对应有各自的预设信息,这些预设信息可作为该声学模型的基础输入信息。例如,通用声学模型的预设信息可包括其该模型的预设音速、预设音量、预设音高、预设音色、预设语调、预设韵律节奏中的两个或两个以上;个性化声学模型的预设信息除了包括该模型的预设音速、预设音量、预设音高、预设音色、预设语调、预设韵律节奏中的两个或两个以上外,还可包括其他的个性化信息,比如包括口头禅、对特定场景的应答方式、智慧类型、性格类型、夹杂的流行语言或方言、对特定人物的称谓等等语言风格特征。Specifically, the terminal is preset with an acoustic model library, and the acoustic model library may include multiple acoustic models, such as a general acoustic model and several personalized acoustic models, and so on. These acoustic models are all neural network models, and these neural network models can be trained in advance from different corpora. For each acoustic model, each acoustic model has its own preset information, and these preset information can be used as the basic input information of the acoustic model. For example, the preset information of the general acoustic model may include two or more of the preset sound speed, preset volume, preset pitch, preset tone, preset tone, and preset rhythm of the model; personality The preset information of the acoustic model includes two or more of the preset sound speed, preset volume, preset pitch, preset tone, preset tone, and preset rhythm. Other personalized information, such as mantras, ways of responding to specific scenes, types of wisdom, personality types, mixed popular languages or dialects, appellations of specific characters, and other language style characteristics.
本发明实施例中,声学模型能够根据预设信息以及预设信息的变化信息,将回复文本转换成回复语音。这里所说的预设信息的变化信息即表示语音合成中所选取的基础TTS参数、强化TTS参数、目标字符与用户偏好读音之间的映射关系、背景音效等信息。通过通用声学模型合成的语音呈现正常、通用对话场景下的声音效果,而通过个性化声学模型合成的语音能够“人物模仿”的对话场景的声音效果。关于“人物模仿”的对话场景的实现方法将在后文详细描述。In the embodiment of the present invention, the acoustic model can convert the reply text into a reply voice according to the preset information and the change information of the preset information. The change information of the preset information referred to here means information such as a basic TTS parameter, an enhanced TTS parameter, a mapping relationship between a target character and a user's preferred pronunciation, and a background sound effect selected in speech synthesis. The speech synthesized through the general acoustic model presents the sound effects under normal and general dialogue scenarios, while the speech synthesized through the personalized acoustic model can "simulate" the sound effects of dialogue scenarios. The implementation method of the dialogue scene of "character imitation" will be described in detail later.
一具体实施例中,终端根据所述当前输入语音从预设的声学模型库中选取声学模型,包括:终端根据所述用户的身份确定所述用户喜好的声学模型;从所述声学模型库的多个声学模型中选取所述用户喜好的声学模型。In a specific embodiment, the terminal selecting an acoustic model from a preset acoustic model library according to the current input voice includes: the terminal determining an acoustic model preferred by the user according to the identity of the user; The acoustic model selected by the user is selected from a plurality of acoustic models.
又一具体实施例中,终端根据所述当前输入语音从预设的声学模型库中选取声学模型,包括:终端根据所述当前输入语音的内容,确定与所述用户的输入语音的内容相关的一个声模标识;所述声学模型的标识用于唯一性地表征该声学模型的声音特点,举例来说,某一声学模型的标识为“林志玲”,说明该声学模型用于合成“林志玲”类型的声音;某一声学模型的标识为“小沈阳玲”,说明该声学模型用于合成“小沈阳”类型的声音,等等。这样,如果输入语音的内容与“林志玲”相关,则可选择具有“林志玲”标识的声学 模型。In yet another specific embodiment, the terminal selecting an acoustic model from a preset acoustic model library according to the current input voice includes: the terminal determining, according to the content of the current input voice, a relationship related to the content of the user's input voice An acoustic model identification; the identification of the acoustic model is used to uniquely characterize the acoustic characteristics of the acoustic model. For example, the identification of an acoustic model is "Lin Zhiling", indicating that the acoustic model is used to synthesize "Lin Zhiling" type Sound; the identification of an acoustic model is "Little Shenyang Ling", indicating that the acoustic model is used to synthesize "Little Shenyang" type sounds, and so on. In this way, if the content of the input speech is related to "Lin Zhiling", the acoustic model with the "Lin Zhiling" logo can be selected.
又一具体实施例中,终端根据所述当前输入语音从预设的声学模型库中选取声学模型包括:终端根据所述用户的身份确定所述多个声学模型中的各个声学模型的权重值;其中,所述各个声学模型的权重值是用户预先设置的,或者,所述各个声学模型的权重值是预先通过学习用户的喜好而确定的。然后,将所述各个声学模型基于所述权重值进行加权叠加,得到综合的声学模型(可称为融合模型),并选取所述融合模型。In still another specific embodiment, the terminal selecting an acoustic model from a preset acoustic model library according to the current input voice includes: the terminal determining a weight value of each of the plurality of acoustic models according to the identity of the user; Wherein, the weight value of each acoustic model is preset by a user, or the weight value of each acoustic model is determined in advance by learning user preferences. Then, the respective acoustic models are weighted and superimposed based on the weight values to obtain a comprehensive acoustic model (which may be referred to as a fusion model), and the fusion model is selected.
步骤110.终端根据回复文本、所确定的TTS参数生成对应的调用指令。Step 110: The terminal generates a corresponding calling instruction according to the reply text and the determined TTS parameters.
具体实施例中,终端可通过语音应答系统的指令生成模块,根据回复文本、所确定的TTS参数等生成TTS引擎所需的调用指令。In a specific embodiment, the terminal may generate a call instruction required by the TTS engine according to a reply text, a determined TTS parameter, and the like through an instruction generation module of the voice response system.
举例来说,参见图17,在一应用场景中,当用户“xiaoming”的输入语音的内容为“‘白云深处有人家’的上一句是什么”时,终端相应生成的回复文本为:“白云深处有人家”的上一句是“远上寒山石径斜”;终端所确定的TTS参数以及终端基于回复文本和所确定的TTS参数生成的调用指令可示例性地参考图17所示图表的描述,这里不再赘述。For example, referring to FIG. 17, in an application scenario, when the content of the input voice of the user “xiaoming” is “What is the last sentence of someone in the depths of Baiyun”, the corresponding response text generated by the terminal is: “ The last sentence of "Someone in the depths of Baiyun" was "Distant slope of the cold mountain trail"; the TTS parameters determined by the terminal and the call instructions generated by the terminal based on the reply text and the determined TTS parameters can be exemplarily referred to the chart shown in Fig. 17 The description is not repeated here.
又举例来说,参见图18,在又一应用场景中,当用户“xiaoming”的输入语音为“说话声音能不能大一点”时,相应生成的回复文本为:“音量已经调大一点了”;终端所确定的TTS参数、以及基于回复文本和所确定的TTS参数生成的调用指令可示例性地参考图18所示图表的描述,这里不再赘述。For another example, referring to FIG. 18, in another application scenario, when the input voice of the user "xiaoming" is "Can the voice be a little louder", the corresponding response text is: "The volume has been turned up a bit" ; The TTS parameters determined by the terminal, and the call instruction generated based on the reply text and the determined TTS parameters can be exemplarily referred to the description of the chart shown in FIG. 18, and are not repeated here.
又举例来说,参见图19,在又一应用场景中,当用户“xiaoming_mom”的输入语音为““说话太慢了,请说快一点吧”时,相应生成的回复文本为:“说话速度已经快一点了”;终端所确定的TTS参数、以及基于回复文本和所确定的TTS参数生成的调用指令可示例性地参考图19所示图表的描述,这里不再赘述。For another example, referring to FIG. 19, in another application scenario, when the input voice of the user "xiaoming_mom" is "Speak too slowly, please speak faster", the corresponding response text is: "Speaking speed Already faster "; the TTS parameters determined by the terminal and the call instructions generated based on the reply text and the determined TTS parameters can be exemplarily referred to the description of the chart shown in FIG. 19 and will not be repeated here.
步骤111.终端基于调用指令执行语音合成操作,具体的,终端通过所述声学模型,根据所述声学模型的预设信息、所述基础语音合成信息和所述强化语音合成信息对所述回复文本进行语音合成,得到回复语音。Step 111: The terminal performs a speech synthesis operation based on the calling instruction. Specifically, the terminal uses the acoustic model to perform a response to the reply text according to preset information of the acoustic model, the basic speech synthesis information, and the enhanced speech synthesis information. Perform speech synthesis to get the reply speech.
具体实施例中,终端可通过语音应答系统的TTS引擎调用步骤S109所确定的声学模型来执行语音合成操作,从而将回复文本基于声学模型的预设信息以及相关的TTS参数进行语音合成,得到回复语音。其中,所述TTS引擎可以是采用基于统计参数合成方法构建的系统,其能充分考虑各种TTS参数而合成出不同风格的语音。In a specific embodiment, the terminal may use the TTS engine of the voice response system to call the acoustic model determined in step S109 to perform a speech synthesis operation, so as to perform a speech synthesis based on the preset information of the acoustic model and related TTS parameters to obtain a reply. voice. The TTS engine may be a system constructed based on a statistical parameter synthesis method, which can fully consider various TTS parameters to synthesize different styles of speech.
步骤112.终端向用户返回回复语音。Step 112. The terminal returns a reply voice to the user.
具体应用场景中,所述终端可通过扬声器向用户播放所述回复语音。在可能实施例中,所述终端还可以通过显示屏显示所述回复语音对应的回复文本。In a specific application scenario, the terminal may play the reply voice to a user through a speaker. In a possible embodiment, the terminal may further display a reply text corresponding to the reply voice through a display screen.
可以看到,实施本发明实施例的技术方案,终端能够基于对话交互的回复文本以及对话上下文信息,为不同的用户选择不同的TTS参数,从而自动结合用户的喜好以及对话情景以生成不同风格的回复语音,向不同用户提供个性化的语音合成效果,大大提升用户与终端的语音交互体验,改善人机对话的时效性。此外,终端也允许用户通过语音实时调教终端的语音应答系统,更新与用户身份、喜好相关联的TTS参数,使得调教出来的终端更加贴近用户的交互喜好,最大化提升用户交互体验。It can be seen that by implementing the technical solution of the embodiment of the present invention, the terminal can select different TTS parameters for different users based on the reply text of the dialogue interaction and the dialogue context information, so as to automatically combine user preferences and dialogue scenarios to generate different styles of Reply to the voice, provide personalized speech synthesis effect to different users, greatly improve the voice interaction experience between the user and the terminal, and improve the timeliness of human-machine dialogue. In addition, the terminal also allows the user to tune the terminal's voice response system in real time by voice, and update the TTS parameters associated with the user's identity and preferences, making the tuned terminal closer to the user's interaction preferences and maximizing the user's interactive experience.
为了更好理解本发明实施例中更新TTS参数的方案,下面将以更新定制字符读音表为例,详细描述基于上述图10实施例步骤S104-S106实现纠正用户指定的目标字符(如多音字)的读音的过程。参见图20,该过程包括但不限于以下步骤:In order to better understand the scheme for updating the TTS parameters in the embodiment of the present invention, the following will take the updating of a custom character pronunciation table as an example to describe in detail the correction of a target character (such as a polyphonic character) specified by the user based on steps S104-S106 of the embodiment of FIG. Pronunciation process. Referring to FIG. 20, the process includes but is not limited to the following steps:
步骤S201.本步骤为上述图10实施例步骤S104的具体细化,在本步骤中,终端识别出用户的说话意图为更正目标字符的读音,比如更正某一个或多个多音字的多音。Step S201. This step is a specific refinement of step S104 in the embodiment of FIG. 10 described above. In this step, the terminal recognizes that the user's speaking intention is to correct the pronunciation of the target character, such as correcting the polyphony of one or more polyphonic characters.
具体实现中,假设用户的说话内容为“说错了,应该读作xiao3 qian4,而不是xiao3 xi1”,终端通过NLU模块对识别文本进行文本分析后,识别出其中的关键词“说错了”、“应该读”。然后,意图识别模块使用这些关键词匹配到预设的句式模板“…念/读/叫/说错了…应该念/读/叫/说…而不是…”,从而确定当前用户的说话意图为“更正目标字符的读音”(即需要更新TTS参数)。In specific implementation, it is assumed that the user ’s speech content is “wrong, should be read as xiao3 qian4, not xiao3 xi1”. After the terminal analyzes the recognized text through the NLU module, it recognizes that the keyword “wrong” "Should be read." Then, the intent recognition module uses these keywords to match the preset sentence template "... read / read / call / speak wrong ... should read / read / call / speak ... not ..." to determine the current user's speaking intentions To "correct the pronunciation of the target character" (that is, the TTS parameter needs to be updated).
步骤S202.本步骤对应于上述图9实施例步骤S105,即终端判断用户的说话意图是否为更新TTS参数。Step S202. This step corresponds to step S105 in the embodiment of FIG. 9 described above, that is, the terminal determines whether the user's speaking intention is to update the TTS parameter.
步骤S203-步骤S205.这几个步骤对应于上述图10实施例步骤S106,即终端执行说话意图指示的TTS参数的更新操作。步骤S203-步骤S205分别详细描述如下:Steps S203 to S205. These steps correspond to step S106 in the embodiment of FIG. 10, that is, the terminal performs an update operation of the TTS parameter indicated by the speaking intention. Steps S203-S205 are described in detail as follows:
步骤S203.终端提取误读读音和目标读音。Step S203. The terminal extracts misreading and target pronunciation.
具体实现中,终端的意图识别模块可基于匹配到的预设的句式模板,将“xiao3 xi1”标为误读读音,将“xiao3 qian4”标为目标读音。In specific implementation, the terminal's intent recognition module may mark "xiao3xixi1" as a misreading pronunciation and "xiao3qian4" as a target pronunciation based on the matched preset sentence template.
步骤S204.终端根据误读读音及上下文信息确定目标字词(即要更正的目标字符)。Step S204. The terminal determines a target word (that is, a target character to be corrected) according to the misreading pronunciation and context information.
具体实现中,终端的DM模块可在上下文信息中找出终端在上一轮对话或上几轮对话所输出的对话文本,确定该对话文本中各字词的读音(如使用声学模型来确定读音)。例如,终端在上一轮对话的输出文本为“很高兴认识你,小茜”,终端确定其对应的读音为“hen3 gao1 xing4 ren4 shi2 ni3,xiao3 xi1”。这样,DM模块将所述误读读音同该所述输出文本的读音串进行匹配,就可以确定误读读音“xiao3xi1”所对应的中文字词为“小茜”,即“小茜”为目标字词(即要更正的目标字符)。In specific implementation, the terminal's DM module can find the dialogue text output by the terminal in the last round or previous rounds of conversations in the context information, and determine the pronunciation of each word in the dialogue text (such as using an acoustic model to determine the pronunciation ). For example, the output text of the terminal in the last round of conversation was "I'm glad to meet you, Xiao Qian", and the terminal determined that its corresponding pronunciation is "hen3, gao1, xing4, ren4, shi2, ni3, xiao3xi1". In this way, the DM module matches the misreading pronunciation with the pronunciation string of the output text, and can determine that the Chinese word corresponding to the misreading pronunciation "xiao3xi1" is "Little Qian", that is, "Little Qian" is the target Word (i.e. the character to be corrected).
步骤S205.终端将目标字词和目标读音加入到与用户身份关联的定制字符读音表。Step S205. The terminal adds the target word and the target pronunciation to a customized character pronunciation list associated with the identity of the user.
具体实施例中,终端通过PM模块将目标字词“小茜”以及目标读音“xiao3 qian4”作为新的目标字符-读音对加入到与当前用户身份关联的定制字符读音表。可以理解的,在以后的人机对话中,当终端的回复文本中含有“小茜”时,PM模块将会根据定制字符读音表的记录来确定“小茜”的读音为“xiao3 qian4”。In a specific embodiment, the terminal adds the target word "小茜" and the target pronunciation "xiao3 qian4" as new target character-phonetic pairs to a custom character pronunciation table associated with the current user identity through the PM module. Understandably, in future man-machine conversations, when the terminal's response text contains "Xiao Qian", the PM module will determine that the pronunciation of "Xiao Qian" is "xiao3qian4" according to the records of the customized character pronunciation table.
可以看到,实施本发明实施例的技术方案,终端能够在语音对话中,基于终端也允许用户通过语音实时调教终端的语音应答系统,基于用户的意图纠正用户指定的目标字符(如多音字)的读音,从而更新与用户身份、喜好相关联的TTS参数,使得调教出来的终端更加贴近用户的交互喜好,最大化提升用户交互体验。It can be seen that by implementing the technical solution of the embodiment of the present invention, the terminal can, in a voice conversation, allow the user to tune the terminal's voice response system through voice in real time, and correct the target character specified by the user (such as polyphonic characters) based on the user's intention To update the TTS parameters associated with the user ’s identity and preferences, making the tuned terminal closer to the user ’s interactive preferences and maximizing the user ’s interactive experience.
为了更好理解本发明实施例中根据用户或者当前对话上下文自适应选择TTS参数的方案,下面详细描述前述图10实施例中步骤S108的具体实现过程,参见图21,该过程可包括以下步骤:In order to better understand the scheme for adaptively selecting TTS parameters according to the user or the current conversation context in the embodiment of the present invention, the detailed implementation process of step S108 in the foregoing embodiment of FIG. 10 is described in detail below. Referring to FIG. 21, the process may include the following steps:
步骤301.本步骤为前述图10实施例中步骤S103的细化,在本步骤中,终端确定当 前用户的用户身份是否已注册(或身份验证是否通过)。Step 301. This step is a refinement of step S103 in the embodiment of FIG. 10 described above. In this step, the terminal determines whether the user identity of the current user is registered (or whether the identity verification is passed).
步骤302.若终端确定当前用户的用户身份已注册,则读取该用户关联的基础TTS参数。Step 302. If the terminal determines that the user identity of the current user is registered, read the basic TTS parameters associated with the user.
如图11所示,例如当前用户是“xiaoming_grandma”,则可在TTS参数库中查找到用户“xiaoming_grandma”关联的基础TTS参数:预设音速的变化系数为-40%,预设音量的变化系数为+40%,预设音高的变化系数为+20%。As shown in FIG. 11, for example, if the current user is "xiaoming_grandma", the basic TTS parameters associated with the user "xiaoming_grandma" can be found in the TTS parameter database: the preset coefficient of change of the sound velocity is -40%, and the preset coefficient of change of the volume It is + 40%, and the preset pitch variation coefficient is + 20%.
步骤303.若终端确定当前用户的用户身份还未注册(或未通过身份认证),则获取默认的基础TTS参数。Step 303. If the terminal determines that the user identity of the current user has not been registered (or has not passed identity authentication), it obtains default basic TTS parameters.
例如当前用户是xiaohua,由于“xiaohua”的身份尚未注册,其在TTS参数库中并不存在,故可返回未注册用户相应的默认值(如图10所示预设音速、预设音量、预设音高的变化系数皆为0)作为当前用户的基础TTS参数。For example, the current user is xiaohua. Since the identity of "xiaohua" has not been registered and does not exist in the TTS parameter database, the corresponding default values for unregistered users can be returned (as shown in Figure 10, preset sound speed, preset volume, preset Let the pitch change coefficients be 0) as the basic TTS parameter of the current user.
步骤304:终端将回复文本同当前用户关联的定制字符读音表进行比较,判断所述文本中是否有存在匹配所述定制字符读音表的字/词/符号,若有,则于获取所述字/词/符号的目标读音。Step 304: The terminal compares the reply text with the custom character pronunciation table associated with the current user, and determines whether there are any words / words / symbols matching the custom character pronunciation table in the text, and if so, obtains the word. / Word / symbol The target pronunciation.
举例来说,如图12所示,若当前用户是“xiaoming”,且当前回复文本中含有“小猪佩奇”,由于其在“xiaoming”关联的定制字符读音表中存在,则将此四字的读音标注为表中对应的读音:xiao3 zhu1 pei4 ki1。For example, as shown in Fig. 12, if the current user is "xiaoming" and the current reply text contains "piglet peculiar", because it exists in the custom character pronunciation table associated with "xiaoming", this four The pronunciation of the word is marked as the corresponding pronunciation in the table: xiao3, zhu1, pei4, and ki1.
步骤305:终端根据回复文本,从TTS参数库中获取对应的强化TTS参数中的语音情感参数。Step 305: The terminal obtains the speech emotion parameters in the corresponding enhanced TTS parameters from the TTS parameter database according to the reply text.
具体实施例中,DM模块可预先设置有情感推荐模型,所述情感推荐模型基于大量带有情感标签的对话文本训练而成。故DM模块将回复文本输入至情感推荐模型,就可确定当前回复文本的情感类别(如高兴、悲伤等)及其情感程度(如轻度高兴、中度高兴等)。继而,PM模块根据DM模块的情感推荐结果从TTS参数库的情感参数修正映射表中确定语音情感参数。例如,如当前回复文本是“那太好了”,情感推荐模型针对该回复文本所推荐的情感为“中度高兴”,则PM模块获取如图13所示的情感参数修正映射表中“中度高兴”对应的语音情感参数。In a specific embodiment, the DM module may be preset with an emotional recommendation model, which is trained based on a large number of dialogue texts with emotional tags. Therefore, the DM module inputs the response text into the emotion recommendation model, and can determine the emotion category (such as happiness, sadness, etc.) and the degree of emotion (such as mild happiness, moderate happiness, etc.) of the current response text. Then, the PM module determines the speech emotion parameters from the emotion parameter correction mapping table of the TTS parameter database according to the emotion recommendation result of the DM module. For example, if the current response text is "That's great" and the emotion recommended by the emotion recommendation model for the response text is "moderately happy", the PM module obtains the "medium" in the emotion parameter correction mapping table shown in FIG. Degree of happiness "corresponding to speech emotion parameters.
步骤306:终端根据回复文本以及上下文信息,从TTS参数库中获取对应的强化TTS参数中的语音场景参数。Step 306: The terminal obtains the voice scene parameters in the corresponding enhanced TTS parameters from the TTS parameter database according to the reply text and the context information.
具体实施例中,DM模块可根据当前对话的上下文信息以及回复文本,确定当前对话的场景。进而,PM模块可根据所确定的对话场景,获取对应的强化语音参数中的语音场景参数。例如,当前回复文本为一句具体的七言诗(比如为“门泊东吴万里船”),DM模块根据对话的上下文信息以及该回复文本确定当前对话场景为古诗接龙场景,此时DM模块可基于该场景定位语音场景为“诗词朗诵”,进而,PM模块获取如图15所示的场景参数修正映射表中“诗词朗诵”对应的语音场景参数。又例如,如果PM模块前对话的上下文信息以及回复文本确定当前是儿歌场景,则定位语音场景为“歌曲哼唱”,PM模块获取如图15所示的场景参数修正映射表中“歌曲哼唱”对应的语音场景参数。又例如,如果PM模块前对话的上下文信息以及回复文本确定当前是角色模仿场景,则定位语音场景为“人物模仿”,PM模块获取如图15所示的场景参数修正映射表中“人物模仿”对应的语音场景参数,等等。In a specific embodiment, the DM module may determine the scene of the current conversation according to the context information of the current conversation and the reply text. Furthermore, the PM module can obtain the voice scene parameters in the corresponding enhanced voice parameters according to the determined dialogue scene. For example, the current reply text is a specific seven-character poem (for example, “Mengbo Dongwu Wanli Ship”), and the DM module determines the current dialogue scene as an ancient poem Solitaire scene based on the context information of the dialogue and the reply text. The scene positioning voice scene is "poem recitation", and further, the PM module obtains a voice scene parameter corresponding to "poem recitation" in the scene parameter correction mapping table shown in FIG. 15. For another example, if the context information of the dialogue before the PM module and the reply text determine that it is currently a children's song scene, the voice scene is positioned as "song humming", and the PM module obtains "song humming" in the scene parameter correction mapping table shown in FIG. 15 "Corresponding voice scene parameters. As another example, if the context information of the dialogue before the PM module and the reply text determine that it is currently a character imitation scene, the voice scene is positioned as "character imitation", and the PM module obtains "character imitation" in the scene parameter correction mapping table shown in FIG. 15. Corresponding voice scene parameters, and so on.
可以看到,实施本发明实施例的技术方案,终端能够基于对话交互的回复文本以及对话上下文信息,为不同的用户选择不同的TTS参数(如基础TTS参数、目标字符的用户偏好读音、语音情感参数、语音场景参数等),从而自动结合用户的喜好以及对话情景以生成不同风格的回复语音,向不同用户提供个性化的语音合成效果,大大提升用户与终端的语音交互体验,改善人机对话的时效性,提升用户交互体验。It can be seen that by implementing the technical solution of the embodiment of the present invention, the terminal can select different TTS parameters for different users (such as basic TTS parameters, user-preferred pronunciation of target characters, and voice emotions) based on the interactive text and dialog context information of the dialog Parameters, voice scene parameters, etc.), so as to automatically combine user preferences and conversation scenarios to generate different styles of reply speech, provide personalized voice synthesis effects to different users, greatly improve the voice interaction experience between users and terminals, and improve human-machine dialogue Timeliness to improve user interaction experience.
为了更好理解本发明实施例的技术方案,下面以“诗词朗诵”的语音场景为例来描述本发明实施例的语音合成方法,参见图22,该方法可通过以下几个步骤进行描述:In order to better understand the technical solution of the embodiment of the present invention, the speech synthesis method of the embodiment of the present invention is described below by taking a speech scene of "poem recitation" as an example. Referring to FIG. 22, the method can be described by the following steps:
步骤401、终端预设有“诗词朗诵”的语音场景参数。Step 401: The terminal presets a voice scene parameter of "poem recitation".
具体实施例中,终端的TTS参数库预设有“诗词朗诵”的语音场景参数。“诗词朗诵”的语音场景注重语音的韵律节奏,“诗词朗诵”的语音场景参数用于调整符合特定句法格式的输入文本的语音停顿位置/停顿时间(即对文本内容的分词)、单字或单词朗读时长、重音位置,从而实现对韵律节奏进行强化。强化后的韵律节奏相对于普通对话时的自然状态的韵律节奏而言,具有更加清晰和强烈情感表述,例如,在朗读诗词、儿歌排比句等特定句法格式时,强化后的韵律节奏能够产生的“抑扬顿挫”感觉。In a specific embodiment, the TTS parameter database of the terminal is preset with a voice scene parameter of "poem recitation". The speech scene of "Poetry Recitation" focuses on the rhythm of speech, and the speech scene parameters of "Poetry Recitation" are used to adjust the speech pause position / pause time (that is, word segmentation of text content), words or words of input text that conform to a specific syntactic format. Read the length and stress position aloud to strengthen the rhythm. Compared with the natural rhythm of normal dialogues, the enhanced rhythmic rhythm has a clearer and stronger emotional expression. For example, when reading specific poems, nursery rhymes, and other specific syntactic formats, the enhanced rhythmic rhythm can produce "Suppression and frustration" feeling.
具体实现中,“诗词朗诵”的语音场景参数可通过韵律节奏模板来实现,对于每一种特定文学样式(或句法格式)的文本内容,可对应于一种或多种韵律节奏模板。对于每种韵律节奏模板而言,其定义了该模板中各个位置的字的音量变化(即该字音量的轻重)和音长的变化(即该字发音时间的长短)、以及该文本中语音的停顿位置/停顿时间(即对文本内容的分词)。韵律节奏模板的产生可以通过以下两种途径实现:In specific implementation, the voice scene parameters of the "Poetry Recitation" can be realized by a rhythmic rhythm template. For each specific literary style (or syntax format) text content, it can correspond to one or more rhythmic rhythm templates. For each rhythmic rhythm template, it defines the volume change of the word (i.e. the weight of the word) and the change of the sound length (i.e. the length of the word's pronunciation time) at each position in the template, and the pronunciation of the word in the text. Pause position / pause time (ie word segmentation of text content). The generation of prosody templates can be achieved in the following two ways:
一种是采用现有的语法规定或者约定俗成的语法和规则来得到与句法格式关联的韵律节奏模板。例如,对于五言绝句(比如“白日依山尽”)的韵律节奏,在分词方法上可以有“2字-3字”以及“2字-2字-1字”的两种方法,其对应的每个字的朗读时间可分别为“短长-短短长”以及“短短-短短-长”,其对应每个字的读音轻重可分别为“轻重-轻轻重”和“轻轻-轻轻-重”。One is to use the existing grammatical rules or conventional grammars and rules to obtain the prosodic template associated with the syntactic format. For example, for the prosodic rhythm of five-character quatrains (such as "Bai Ri Yi Shan Jin"), the word segmentation method can have two methods: "2 words-3 words" and "2 words-2 words-1 word". The corresponding reading time of each word can be "short-long-short-long" and "short-short-long-long", and the pronunciation of each word can be "light-light-light-weight" and "light-weight" Light-light-heavy. "
另一种是根据声音模特朗读的特殊韵律节奏的语料进行训练和学习,基于统计、机器学习以及深度网络等框架获得包括停顿位置、单字或单词朗读长度以及重音位置的模型。模型训练好之后,将需要应用“诗词朗诵”模式的文本内容输入至该模型,就会得到该文本内容对应的韵律节奏模板。The other is training and learning based on the special rhythmic corpus of voice models reading, and based on statistics, machine learning, and deep network frameworks to obtain models including pause positions, word or word reading lengths, and accent positions. After the model is trained, the text content that needs to be applied to the "Poetry Recitation" mode is input to the model, and the prosody template corresponding to the text content is obtained.
步骤402、终端通过回复文本、上下文信息确定当前对话的语音场景为“诗词朗诵”的语音场景。Step 402: The terminal determines that the voice scene of the current conversation is a voice scene of "poem recitation" by replying to text and context information.
具体实施例中,终端可通过DM模块确定当前对话的语音场景为“诗词朗诵”的语音场景。具体的,DM模块确定当前对话为“诗词朗诵”的语音场景的方式可包括以下几种:In a specific embodiment, the terminal may determine that the voice scene of the current conversation is a voice scene of "poem recitation" through the DM module. Specifically, the manner in which the DM module determines the current scene as a voice scene of "poem recitation" may include the following:
一种方式是在对话过程中,用户的输入语音所包含的用户意图明确指示当前对话为“诗词朗诵”的语音场景,DM模块结合意图识别模块确定了用户意图后,进而确定当前对话为“诗词朗诵”的语音场景。举例来说,用户输入语音指示终端进行唐诗朗诵或进行古诗词接龙,那么终端识别出用户意图后,自动将当前对话场景设置为“诗词朗诵”的语音场景。One way is that during the dialogue, the user's input voice contains the user's intention to clearly indicate that the current dialogue is a "poem recitation". The DM module combined the intent recognition module to determine the user's intention, and then determined that the current dialogue is a "poem." Recitation "of the voice scene. For example, if the user inputs a voice to instruct the terminal to perform Tang poetry recitation or ancient poem Solitaire, then the terminal automatically sets the current dialogue scene as the "poem recitation" voice scene after recognizing the user's intention.
一种方式是在普通对话中,用户虽没有明确的意图明确指示当前对话为“诗词朗诵”, 但终端还是可通过DM模块也可以判断回复文本的内容是否涉及了诗、词、曲、赋等特定文学样式的一种或多种,比如涉及到五言绝句或七言绝句或律诗,或者涉及到具体的词牌或曲牌等。具体实现中,DM模块可通过文本搜索匹配或语义分析等方法,搜索本地预存的文库或者搜索网络服务器中的文库,文库中可包含各种各样的文学知识资料对应的文学样式,DM模块进而判断回复文本的内容是否存在于文库中,若存在,则将当前对话场景设置为“诗词朗诵”的语音场景。One way is that in a normal conversation, although the user has no clear intention to clearly indicate that the current conversation is "poem recitation", the terminal can still determine whether the content of the reply text involves poems, words, songs, fu, etc. through the DM module. One or more of specific literary styles, such as five-character quatrains or seven-character quatrains or rhythmic poems, or specific word or song cards. In specific implementation, the DM module can search local pre-stored libraries or search libraries in the web server through text search matching or semantic analysis. The library can contain a variety of literary styles corresponding to literary knowledge materials. The DM module further Determine whether the content of the reply text exists in the library, and if so, set the current dialogue scene to the voice scene of "Poetry Recitation".
还有一种方式是预先存储各种文学样式(或句法格式)对应的字数、句子个数、每句字数的顺序等文学样式特征,DM模块可通过分析回复文本中的标点(停顿)、字数、句子个数、每句字数的顺序等特征,将该回复文本中的一段文本或全部文本与预存的文学样式特征做匹配,如果匹配成功,则该符合预存的文学样式特征的一段文本或全部文本即可作为采用“诗词朗诵”的语音场景的文本。举例来说,五言绝句的文学样式特征包括:4个句子,每句皆5个字,共20字。五言律诗的文学样式特征包括:8个句子,每句皆5字,共40字。七言绝句的文学样式特征包括:4个句子,每句皆7个字,共28个字。又举例来说,宋词小令《如梦令》的文学样式特征包括:7个句子,各句的字数分别为6字、6字、5字、6字、2字、2字、6字。假如回复文本中的一段文本为“窗外群山如黛,教室百无聊赖。台上的老师,讲课语速澎湃。真快,真快,直叫骏马难逮。”,那么DM模块可确定文学样式特征符合《如梦令》的文学样式特征,从而将当前对话场景设置为“诗词朗诵”的语音场景。Another way is to store the literary style features such as the number of words, the number of sentences, and the order of the number of words in each sentence in advance in various literary styles (or syntax formats). The DM module can analyze punctuation (pauses), word counts, The number of sentences, the order of the number of words in each sentence, etc., match a piece of text or all the text in the reply text with the pre-stored literary style feature. If the match is successful, the piece of text or all text that matches the pre-stored literary style feature. It can be used as the text of the voice scene of "Poetry Recitation". For example, the literary style characteristics of five-character quatrains include: 4 sentences, each sentence is 5 words, a total of 20 words. The literary style features of the five-character poetry include: 8 sentences, each sentence is 5 words, a total of 40 words. The literary style characteristics of the seven-character quatrains include: 4 sentences, each sentence is 7 words, a total of 28 words. For another example, the literary style features of Song Ci Xiaoling's "Rumengling" include: 7 sentences, each of which has 6 characters, 6 characters, 5 characters, 6 characters, 2 characters, 2 characters, and 6 characters. If a piece of text in the reply text reads: “The mountains are like daisies outside the window, the classroom is boring. The teacher on the stage speaks at a high speed. It ’s fast, really fast, and it ’s hard to catch a horse.”, Then the DM module can determine the literary style characteristics In line with the literary style characteristics of "Ru Meng Ling", the current dialogue scene is set as the voice scene of "Poetry Recitation".
步骤403、终端确定当前“诗词朗诵”语音场景对应的语音场景参数。Step 403: The terminal determines a voice scene parameter corresponding to the current "poem recitation" voice scene.
具体实施例中,终端通过PM模块确定当前“诗词朗诵”语音场景对应的语音场景参数。In a specific embodiment, the terminal determines a voice scene parameter corresponding to the current "poem recitation" voice scene through the PM module.
在一可能的实现方式中,由于文学样式(或文学样式特征)与韵律节奏模板相关联。那么当确定了当前回复文本中所涉及的文学样式(或文学样式特征)后,PM模块即可从TTS参数库中获取与之关联的韵律节奏模板,该韵律节奏模板即包含了对应的语音场景参数(即包含了韵律节奏变化信息),具体的,该语音场景参数包括该模板中各个位置的字的音量变化和音长的变化、以及该文本中语音的停顿位置/停顿时间等信息(参数)。例如,对于五言绝句的韵律节奏模板,韵律节奏模板对应的语音场景参数包括了具体的分词方法,每句话中的每个字的朗读时间长短,以及各个字的读音轻重的信息。In a possible implementation manner, the literary style (or literary style feature) is associated with the rhythm template. Then after determining the literary style (or literary style feature) involved in the current reply text, the PM module can obtain the prosodic rhythm template associated with it from the TTS parameter database, and the prosodic rhythm template contains the corresponding voice scene Parameters (that is, including rhythmic rhythm change information), specifically, the voice scene parameters include information such as volume changes and sound length changes of words at various positions in the template, and pause positions / pause times of speech in the text (parameters) . For example, for the prosodic rhythm template of five-character quatrains, the speech scene parameters corresponding to the prosodic rhythm template include specific word segmentation methods, the length of time in which each character is read aloud in each sentence, and information on the pronunciation of each character.
在又一可能实现方式中,语音场景参数的选择可能还与语音情感参数息息相关,也就是说,不同的情感类别(如高兴、悲伤)、不同的情感级别(如轻度高兴、中度高兴)都可能会对语音场景参数的造成影响,亦即会影响文学样式(或文学样式特征)对应的韵律节奏模板的具体参数。这样设计的好处是可以使得语音场景更加贴近于当前的语音情感,有利于最终的语音输出更加生动合理。In another possible implementation manner, the selection of the voice scene parameters may also be closely related to the voice emotion parameters, that is, different emotion categories (such as happiness, sadness), and different emotion levels (such as mild happiness, moderate happiness) Both may affect the voice scene parameters, that is, the specific parameters of the prosody template corresponding to the literary style (or literary style features). The advantage of this design is that the voice scene can be closer to the current voice emotion, which is conducive to the final voice output more vivid and reasonable.
举例来说,对于五言绝句的一种韵律节奏模板,其标准的参数包括:在分词方法上为“2字-3字”,其对应的每个字的朗读时间可分别为“短长-短短长”,其对应每个字的读音轻重可分别为“轻重-轻轻重”。那么,在不同的语音情感参数下,该韵律节奏模板的最终语音呈现也会有差异,这种差异可以存在于断字、音调以及重音等的变化中。如下表1所示,表1示出了对于五言绝句的一种韵律节奏模板,不同的语音情感对该韵律节奏模板产生的 影响。其中,表1所列的语音情感1、语音情感2、语音情感3可能表示情感类别(如高兴、中性情感、悲伤),也可能表示情感级别(如轻度高兴、中度高兴、极度高兴)。因此,对于所确定的韵律节奏模板,PM模块可根据回复文本的语音情感参数,从类似于表1所示的规则中,确定出最终的语音场景参数。For example, for a rhythmic template of five-character quatrains, the standard parameters include: "2 words-3 words" in the word segmentation method, and the corresponding reading time of each word can be "short- "Short length", the corresponding pronunciation of each word can be "light weight-light weight". Then, under different speech emotion parameters, the final speech presentation of the prosodic rhythm template will also be different, and this difference can exist in changes such as hyphenation, tonality, and stress. As shown in Table 1 below, Table 1 shows a prosodic rhythm template for five-character quatrains, and different phonetic emotions affect the prosodic rhythm template. Among them, the voice emotion 1, voice emotion 2, and voice emotion 3 listed in Table 1 may indicate emotion categories (such as happiness, neutral emotion, sadness), and may also indicate emotion levels (such as mild happiness, moderate happiness, and extreme happiness). ). Therefore, for the determined rhythmic template, the PM module can determine the final speech scene parameters from the rules similar to those shown in Table 1 according to the speech emotion parameters of the reply text.
表1Table 1
  Zh 语音情感1Speech emotion 1 语音情感2Phonetic Emotion 2 语音情感3Speech emotion 3
2字与3字之间间隔Space between 2 and 3 words 1.1倍标准间隔时长1.1 times the standard interval time 1.2倍标准间隔时长1.2 times the standard interval time 1.3倍标准间隔时长1.3 times the standard interval time
重读发音增加程度Increased pronunciation 1.05倍音量1.05 times the volume 1.10倍音量1.10 times the volume 1.15倍音量1.15 times the volume
音调变化幅度Pitch change 1.2倍基音标准差1.2 times the standard deviation of the pitch 1.4倍基音标准差1.4 times the standard deviation of the pitch 1.6倍基音标准差1.6 times the standard deviation of the pitch
需要说明的是,在结合语音情感与韵律节奏模板方面,本发明并不局限于表1所示的实施方式,在其他可能实施方式中,也通过深度学习的方式,采用支持向量机(Support Vector Machine,SVM)或深度神经网络基于大量的不同语音情感对应的韵律节奏模板进行模型训练,得到训练后的深度学习模型,这样,在实际应用中终端可将回复文本对应的标准韵律节奏模板以及回复文本对应的语音情感参数一起输入至深度学习模型,就可得到最终的语音场景参数。It should be noted that the present invention is not limited to the embodiments shown in Table 1 in terms of combining speech emotions and prosodic rhythm templates. In other possible embodiments, a support vector machine (Support Vector Machine) is also adopted by deep learning. Machine (SVM) or deep neural network for model training based on a large number of prosodic rhythm templates corresponding to different speech emotions to obtain a trained deep learning model. In this way, the terminal can use the standard prosodic rhythm template corresponding to the response text and the response The speech emotion parameters corresponding to the text are input to the deep learning model together to obtain the final speech scene parameters.
步骤404、终端对回复文本的内容进行韵律节奏模板对齐,以便于后续的语音合成。Step 404: The terminal aligns the rhythmic template of the content of the reply text to facilitate subsequent speech synthesis.
具体实施例中,在需要进行语音合成时,终端可将回复文本中的相关内容与“诗词朗诵”语音场景的韵律节奏模板进行对齐。具体的,终端可将回复文本中的相关内容对应声学模型库的读音与韵律节奏模板的参数结合,参考一定的尺度将韵律节奏模板的参数叠加到这些读音语段中。In a specific embodiment, when speech synthesis is required, the terminal may align the relevant content in the reply text with the rhythmic template of the "poem recitation" voice scene. Specifically, the terminal may combine the pronunciation of the corresponding acoustic model library with the relevant content in the reply text and the parameters of the prosodic rhythm template, and refer to a certain scale to superimpose the parameters of the prosodic rhythm template into these pronunciation segments.
例如,在一种示例性的实施例中,韵律加强参数为ρ(0<ρ<1),文本内容中第i个字的预设音量为Vi,若该字的韵律节奏特征包含了重读特征,其重读变化量为E1,则该字的最终音量为Vi×(1+E1)×(1+ρ)。又例如,文本中第i个字的基础音长为Di,音长的变化量为E2,则该字的最终音长为Di×(1+E2)。又例如,第i个字和第i+1个字之间需要停顿,停顿时间从0s变为0.02s。For example, in an exemplary embodiment, the prosody enhancement parameter is ρ (0 <ρ <1), and the preset volume of the i-th word in the text content is Vi. If the prosodic rhythm feature of the word includes the accent feature , Whose re-reading change is E1, then the final volume of the word is Vi × (1 + E1) × (1 + ρ). For another example, if the basic sound length of the i-th word in the text is Di and the change amount of the sound length is E2, then the final sound length of the character is Di × (1 + E2). As another example, a pause is required between the i-th word and the i + 1-th word, and the pause time is changed from 0s to 0.02s.
又举例来说,参见图23,回复文本中包括“白日依山尽”这样的文本内容,“白日依山尽”属于五言绝句诗歌的第一句,如果单单只是采用通用声学模型对回复文本进行语音合成,那么其合成的语音(可称为基础读音语段)为“bai2 ri4 yi1 shan1 jin4”,各个字的基础读音的音长皆为0.1s,各个字的基础读音之间默认间隔为0。而本发明实施例中,终端在选取TTS参数的环节中在采用五言绝句对应的韵律节奏模板,这样后续通过通用声学模型对回复文本进行合成的过程中,额外采用了这个五言绝句对应的韵律节奏模板对这个基础读音语段进行了叠加,这样最终合成的语音中,如图23所示,在朗读时间方面,该语段中不同的字的读音的音长分别被不同程度地拉长(如“ri4”的音长变成0.17s,“shan1”的音长变为0.14s,“jin4”的音长变为0.17s);在分词方面,“bai2 ri4”和“yi1 shan1 jin4”之间出现停顿,停顿时间为0.02s;在读音轻重方面,“ri4”和“jin4”均被加强了重音。也就是说,本发明实施例中将回复文本的内容与韵律节奏模板对齐后,后续经过TTS模块进行语音合成得到的语音将能够呈现出“诗歌朗诵”语音场景的效果。For another example, referring to FIG. 23, the reply text includes text content such as "Bai Ri Yi Shan Jin". "Bai Ri Yi Shan Jin" belongs to the first sentence of five-character quatrain poems. If only the general acoustic model is used, The reply text is synthesized by speech, then the synthesized speech (which can be called basic pronunciation segment) is "bai2, ri4, yi1, shan1, and jin4", and the basic pronunciation of each character is 0.1s. The basic pronunciation of each character is the default. The interval is 0. In the embodiment of the present invention, the terminal uses the prosodic rhythm template corresponding to the five-character quatrains in the selection of the TTS parameters, so that in the subsequent process of synthesizing the reply text through the general acoustic model, the five-character quasi-correspondence is additionally used The rhythmic rhythm template superimposes this basic pronunciation segment, so that in the final synthesized speech, as shown in FIG. 23, in terms of reading time, the length of the pronunciation of different words in the segment is lengthened to different degrees, respectively. (For example, the sound length of "ri4" becomes 0.17s, the sound length of "shan1" becomes 0.14s, and the sound length of "jin4" becomes 0.17s); in terms of word segmentation, "bai2 ri4" and "yi1 shan1 jin4" There was a pause between 0.02s; in terms of pronunciation, "ri4" and "jin4" were both accented. That is, in the embodiment of the present invention, after the content of the reply text is aligned with the rhythmic rhythm template, the speech obtained by subsequent speech synthesis through the TTS module will be able to present the effect of a "poetry recitation" speech scene.
下面以“歌曲哼唱(以儿歌哼唱为例)”的语音场景为例来描述本发明实施例的语音合成方法,参见图24,该方法可通过以下几个步骤进行描述:The following describes a speech synthesis method according to an embodiment of the present invention by using a voice scene of "song humming (take a nursery rhyme as an example)" as an example. Referring to FIG. 24, the method can be described by the following steps:
步骤501、终端预设有“儿歌哼唱”的语音场景参数。Step 501: The terminal presets a voice scene parameter of "child songs humming".
具体实施了中,终端的TTS参数库预设有“儿歌哼唱”的语音场景参数。在音乐中,时间被分成均等的基本单位,每个基本单位叫做一个“拍子”或称一拍。拍子的时值是以音符的时值来表示的,一拍的时值可以是四分音符(即以四分音符为一拍),也可以是二分音符(以二分音符为一拍)或八分音符(以八分音符为一拍)。音乐的节奏一般以节拍来定义,例如4/4拍:4/4拍是4分音符为一拍,每小节4拍,可以有4个4分音符。所谓预设“儿歌哼唱”的语音场景参数,即预设各种各样儿歌的节拍类型,以及对需要以“儿歌哼唱”方式进行语音合成的回复文本内容进行文本分割的方式。In specific implementation, the TTS parameter database of the terminal is preset with voice scene parameters of “Children's Song Humming”. In music, time is divided into equal basic units, and each basic unit is called a "beat" or a beat. The time value of the beat is expressed by the time value of the note. The time value of a beat can be a quarter note (that is, a quarter note is a beat), a half note (a quarter note is a beat), or eight. Quarter note (takes eighth note as a beat). The rhythm of music is generally defined by beats, such as 4/4 beats: 4/4 beats are 4 quarter notes as a beat, and 4 beats per measure can have 4 quarter notes. The so-called preset "song song humming" voice scene parameters, that is, preset a variety of types of children's songs beat type, and the text segmentation of the content of the reply text that needs to be synthesized in a "song song humming" manner.
具体实施例中,对于“儿歌哼唱”的语音场景,可根据两个标点内的字数或分词后的每个字段的字数来确定儿歌的节拍。举例来说,对于这样儿歌类型的回复文本“小燕子,穿花衣,年年春天来这里,要问燕子你为啥来,燕子说,这里的春天最美丽”,可通过以下2两种方式来对回复文本进行文本分割,从而确定出最佳匹配的节拍:In a specific embodiment, for the voice scene of “Children ’s Song Humming”, the beat of the children ’s song may be determined according to the number of words in the two punctuation points or the number of words in each field after the word segmentation. For example, for this kind of children's song type response text "Little swallow, wear flower clothes, come here every spring, ask swallow why do you come, swallow said, spring is the most beautiful here", you can use the following two ways Text-splitting the response text to determine the best-matching beat:
一种方式是按照标点符号来切割回复文本,也就是说识别出该回复文本中的标点符号,由各个标点符号分割的各个字段的字数分别为“3,3,7,8,3,8”,可以看到,字数为“3”的字段出现得最多,所以,可确定与该回复文本最匹配的节拍是3的倍数,如3/3节拍、3/4节拍等。One way is to cut the reply text according to the punctuation marks, that is, to identify the punctuation marks in the reply text, and the number of words in each field divided by each punctuation mark is "3,3,7,8,3,8" It can be seen that the field with the word number "3" appears the most, so it can be determined that the beat that most closely matches the reply text is a multiple of 3, such as 3/3 beat, 3/4 beat, and so on.
另一种方式是按照分词结果来分割回复文本,分词结果例如为“小/燕子/穿/花衣/年年/春天/来/这里/要/问/燕子/你/为啥/来/燕子/说/这里/的/春天/最/美丽”,为了保持语义的连贯性,可以对分词的结果进行调整,将修饰名词的动词、形容词和副词,与被修饰的名词连接,合并为一个词。经过处理后,之前的分词结果进一步变为“小燕子/穿花衣/年年/春天/来这里/要/问燕子/你为啥/来/燕子说/这里的/春天/最美丽”,分割后的各个字段的字数分别为“3,3,2,2,3,1,3,3,1,3,3,2,3”,可以看到,字数为“3”的字段出现得最多,所以,可确定与该回复文本最匹配的节拍是3的倍数,如3/3节拍、3/4节拍等。Another way is to segment the reply text according to the word segmentation result. The word segmentation result is, for example, "little / swallow / wear / flower clothing / yearly / spring / come / here / to / question / swallow / you / why / come / swallow / Say / here / of / spring / most / beautiful ", in order to maintain semantic coherence, the results of the segmentation can be adjusted, and the verbs, adjectives and adverbs that modify the noun are connected with the modified noun and merged into one word. After processing, the previous participle result further changed to "little swallows / wearing flowers / yearly / spring / coming here / to / questioning swallows / why are you / coming / swallows saying / here / spring / most beautiful", divided The number of words in each subsequent field is "3,3,2,2,3,1,3,3,1,3,3,2,3", as can be seen, the field with the word number "3" appears the most Therefore, it can be determined that the beat that most closely matches the reply text is a multiple of 3, such as 3/3 beat, 3/4 beat, and so on.
步骤502、终端通过回复文本、上下文信息确定当前对话的语音场景为“儿歌哼唱”的语音场景。Step 502: The terminal determines that the voice scene of the current conversation is a voice scene of "humor songs" by replying to text and context information.
具体实施例中,终端可通过DM模块确定当前对话的语音场景为“儿歌哼唱”的语音场景。具体的,DM模块确定当前对话为“儿歌哼唱”的语音场景的方式可包括以下几种:In a specific embodiment, the terminal may determine, through the DM module, that the voice scene of the current conversation is the voice scene of "child songs and humming". Specifically, the manner in which the DM module determines the current scene as a voice scene of "Children's Song Humming" may include the following:
一种方式是在对话过程中,用户的输入语音所包含的用户意图明确指示当前对话为“儿歌哼唱”的语音场景,DM模块结合意图识别模块确定了用户意图后,进而确定当前对话为“儿歌哼唱”的语音场景。举例来说,用户输入语音指示终端唱儿歌,那么终端识别出用户意图后,自动将当前对话场景设置为“儿歌哼唱”的语音场景。One way is that during the conversation, the user's input voice contains the user's intention to clearly indicate that the current conversation is a “child song humming”. After the DM module determines the user ’s intention in conjunction with the intent recognition module, the current conversation is determined as "Song of children's songs". For example, if the user inputs a voice to instruct the terminal to sing children's songs, then after the terminal recognizes the user's intention, it automatically sets the current dialogue scene to the voice scene of "children's humming".
一种方式是在普通对话中,用户虽没有明确的意图明确指示当前对话为“儿歌哼唱”,但终端还是可通过DM模块也可以判断回复文本的内容是否涉及了儿歌的内容。具体实现中,DM模块可通过文本搜索匹配或语义分析等方法,搜索本地预存的儿歌库或者搜索网络服务器中的儿歌库,儿歌库中可包含各种各样的儿歌的歌词,DM模块进而判断回复文本的内容是否存在于这些儿歌歌词中,若存在,则将当前对话场景设置为“儿歌哼唱”的语音 场景。One way is that in a normal conversation, although the user does not have a clear intention to explicitly indicate that the current conversation is "Children's Song Humming", the terminal can still determine whether the content of the reply text involves the content of children's songs through the DM module. In specific implementation, the DM module can search the local pre-stored nursery rhyme library or search the nursery rhyme library in the web server through text search matching or semantic analysis. The lyric library can contain various lyrics of the nursery rhymes, and the DM module judges Whether the content of the reply text exists in these children's song lyrics, and if it exists, the current dialogue scene is set to the voice scene of "Children's Song Humming".
步骤503、终端确定当前“儿歌模式”对应的语音场景参数。Step 503: The terminal determines a voice scene parameter corresponding to the current "Children's Song Mode".
具体实施例中,终端通过PM模块确定当前“儿歌模式”对应的语音场景参数。具体的,PM模块可根据回复文本的内容确定文本分割的方式(参考上述步骤502所描述的两种方式),采用该方式对回复文本进行文本分割,得到分割结果。继而,根据分割结果确定最佳匹配的节拍。In a specific embodiment, the terminal determines a voice scene parameter corresponding to the current "children's song mode" through a PM module. Specifically, the PM module may determine a text segmentation method according to the content of the reply text (refer to the two methods described in step 502 above), and use this method to perform text segmentation on the reply text to obtain a segmentation result. Then, the best matching beat is determined according to the segmentation result.
步骤504、终端对回复文本的内容进行节拍对齐,以便于后续的语音合成。Step 504: The terminal aligns the content of the reply text to facilitate subsequent speech synthesis.
具体实施例中,终端可通过PM模块将回复文本的内容对齐所确定的节拍,以保证文本的各个字段与儿歌节拍的变化规律相融合。具体的,终端将切割好的文本字段按照节拍的变化规律与时间轴对齐。In a specific embodiment, the terminal may align the content of the reply text with the determined beat through the PM module, so as to ensure that each field of the text is fused with the changing rule of the rhythm of the nursery rhyme. Specifically, the terminal aligns the cut text field with the time axis according to the change rule of the beat.
举例来说,回复文本中的某个字段的字数为3,其匹配的节拍为3/3或者3/4拍,那么可将这3个字分别与一个小节内的3个拍子分别对齐。For example, if the number of words in a field in the reply text is 3, and the matching beat is 3/3 or 3/4, then the 3 words can be aligned with the 3 beats in a measure.
又举例来说,回复文本中的某个字段的字数小于小节内拍子的数量,如该字段为2个字,而节拍为4/4拍,则搜索该字段前后相邻的文本字段,如果该字段之前的字段(或该字段之后的字段)的字数为2,则可以将该字段和该字段之前的字段合并,共同对齐小节内的4个拍子。如果前后的字段无法合并,或者合并后的字数仍然小于节拍数,则还可进一步通过以下方式进行节拍对齐。For another example, the number of words in a field in the reply text is less than the number of beats in the measure. If the field is 2 words and the beat is 4/4 beats, then search for adjacent text fields before and after the field. The number of words in the field before the field (or the field after the field) is 2, then you can merge this field with the field before the field to align the 4 beats in the measure together. If the fields before and after cannot be merged, or the number of words after the merge is still less than the number of beats, you can further align the beats in the following ways.
一种方式是将文字比节拍数少的部分用空白填补。具体的,如果与一个小节的音乐匹配的文字数小于节拍数,在匹配时只要保证每一个字对应一个节拍在时间上的位置,剩余的部分用静音补齐。如图25中(a)所示,对于回复文本中的字段“小白兔”,其匹配的节拍类型是4/4拍,那么可将“小”“白”“兔”分别对齐小节内的第1拍、第2拍、第3拍,最后采用静音来补齐第4拍。需要说明的是,该图只是展示了一种实施情况,实际操作中,静音可能是第1到第4拍的任意一个位置。One way is to fill the text with less than the number of ticks with a blank. Specifically, if the number of words matched with one bar of music is less than the number of beats, as long as each word corresponds to the position of one beat in time during matching, the remaining part is filled with mute. As shown in (a) of FIG. 25, for the field "Little White Rabbit" in the reply text, the matching beat type is 4/4 beat, then the "small", "white", and "rabbit" can be aligned with the The first, second, and third beats, and finally the mute to complete the fourth beat. It should be noted that this figure only shows an implementation situation. In actual operation, the mute may be any one of the first to fourth beats.
另一种方式是通过拉长某一个字的音长来对齐节奏。具体的,当一个小节音乐匹配的字数小于节拍数时,可以通过拉长某一个或几个字的读音时间,来实现字和节拍对齐的目的。如图25中(b)所示,对于回复文本中的字段“小白兔”,其匹配的节拍类型是4/4拍,那么可将“小”“白”分别对齐小节内的第1拍、第2拍,对“兔”的读音进行拉长,使得“兔”对齐第3拍和第4拍。需要说明的是,该图只是展示了一种实施情况,实际操作中,读音拉长处理的对象可能是“小白兔”中的任意一个字。Another way is to align the rhythm by lengthening the sound length of a word. Specifically, when the number of words matched by a measure of music is less than the number of beats, the purpose of aligning the words and the beats can be achieved by lengthening the reading time of one or more words. As shown in (b) of FIG. 25, for the field "Little White Rabbit" in the reply text, the matching beat type is 4/4 beats, then the "small" and "white" can be aligned with the first beat in the measure, respectively. 2. On the second beat, stretch the pronunciation of "rabbit" so that "rabbit" is aligned with the third and fourth beats. It should be noted that this figure only shows an implementation situation. In actual operation, the object of the elongation processing of the pronunciation may be any word in "Little White Rabbit".
再一种方式是平均拉长各个字的音长保证整体时间对齐。具体的,可采用平均延长文字字段中每一个字的读音时间的方法,让字的读音时间和音乐的节拍进行对齐。如图25中(c)所示,对于回复文本中的字段“小白兔”,其匹配的节拍类型是4/4拍,那么可以将每个字的朗读时间皆拉长为4/3拍的时长,就可以保证整个字段对齐节拍。Another way is to lengthen the sound length of each word evenly to ensure the overall time alignment. Specifically, a method of extending the pronunciation time of each character in the text field averagely may be used to align the pronunciation time of the character with the beat of the music. As shown in (c) of FIG. 25, for the field "Little White Rabbit" in the reply text, the matching beat type is 4/4 beats, so the reading time of each word can be lengthened to 4/3 beats. The duration can ensure that the entire field is aligned to the beat.
下面以用于实现“人物模仿”的声学模型为例来描述本发明实施例的语音合成方法,参见图26,该方法可通过以下几个步骤进行描述:The following uses the acoustic model for implementing "person imitation" as an example to describe the speech synthesis method of the embodiment of the present invention. Referring to FIG. 26, the method can be described by the following steps:
步骤601、终端的声学模型库预设有用于实现“人物模仿”的声学模型。Step 601: The acoustic model library of the terminal presets an acoustic model for implementing "character imitation".
具体实施例中,终端的声学模型库预设有用于实现“人物模仿”的各种声学模型(即 个性化声学模型)。“人物模仿”的声学模型可用于使合成语音具有特定人物的声音特点,所以”人物模仿”的声学模型的预设音色、预设语调、预设韵律节奏等信息与通用声学模型的这些信息会有所差异。“人物模仿”的声学模型所模仿的人物可能是用户本身的喜好的人物形象,可能是影视作品中的人物角色,还可能是多种预设声模与用户喜好的综合,例如,这些“人物模仿”的声学模型可以是用户模仿用户自身说话风格的声学模型;还可以是模仿其他人物说话特点的声学模型,例如用于模仿“林志玲/柔美声音”的声学模型,可以是模仿“小沈阳/搞笑声音”的声学模型,可以是模仿“刘德华/浑厚声音”的声学模型,等等。此外,在可能的实施例中,终端在语音合成过程中选取的并不是声学模型库中某个具体的声学模型,而是声学模型库中的多个声学模型的综合模型。In a specific embodiment, the acoustic model library of the terminal is preset with various acoustic models (i.e., personalized acoustic models) for implementing "character imitation". The acoustic model of "character imitation" can be used to make the synthesized speech have the sound characteristics of a specific character. Therefore, the information of the preset voice, preset tone, and preset rhythm of the acoustic model of "character imitation" and this information of the general acoustic model will be There are differences. The character imitated by the "personal imitation" acoustic model may be the user's favorite personal image, a character in a film or television work, or a combination of multiple preset sound modes and user preferences. For example, these "persons" The acoustic model of “imitation” can be an acoustic model in which the user imitates the user's own speaking style; it can also be an acoustic model that mimics the speaking characteristics of other characters, for example, an acoustic model used to imitate “Lin Zhiling / Soft Voice”, which can imitate “Little Shenyang / The acoustic model of "funny sound" can be an acoustic model that mimics "Andy Lau / Deep Voice", and so on. In addition, in a possible embodiment, during the speech synthesis process, the terminal does not select a specific acoustic model in the acoustic model library, but a comprehensive model of multiple acoustic models in the acoustic model library.
在声学模型库中,除了可预设某些特定的人物声音特点的声学模型外,还可将不同的语音特征、不同的语言风格特征按用户的喜好或需求进行组合,从而形成具有个性特点的声学模型。其中,语音特征包括说话的语速(音速)、语调、韵律节奏、音色等等,其中,音色的变化是声音除了有一个‘基音’外,还自然而然加上许多不同‘声音频率’与泛音‘交织’,就决定了不同的音色,使人听了以后能辨别出是不同的声音。这些不同声音所表征的人物可以是自然人(如用户、声模等),也可以是动画角色或虚拟角色(如机器猫、洛天依等)。语言风格特征包括口头禅(包括常用语气词)、对特定场景的应答特点、智慧类型、性格类型、说话中夹杂的流行语言/方言、对特定人物的称谓等。也就是说,将不同的语音特征、不同的语言风格特征按用户的喜好或需求进行组合而成的声学模型,其预设信息除了包括预设音速、预设音量、预设音高、预设音色、预设语调和预设韵律节奏中的两个或两个以上等信息,还包括了语言风格特征。In the acoustic model library, in addition to the acoustic models that can preset some specific character sound characteristics, different voice characteristics and different language style characteristics can be combined according to user preferences or needs, so as to form individual characteristics. Acoustic model. Among them, the characteristics of speech include the speed of speech (sound velocity), intonation, rhythm, tone color, etc. Among them, the change in tone color is that in addition to a 'basal tone', the sound naturally adds many different 'sound frequencies' and overtones' 'Interlaced', determines different tones, so that people can distinguish different sounds after listening. The characters represented by these different sounds can be natural persons (such as users, sound models, etc.), or they can be animated characters or virtual characters (such as robot cats, Luo Tianyi, etc.). Linguistic style features include mantras (including common mood words), response characteristics to specific scenes, types of wisdom, personality types, popular languages / dialects mixed in speech, and titles of specific characters. That is to say, the acoustic model that combines different voice characteristics and different language style characteristics according to the user's preferences or needs, in addition to the preset information includes preset sound speed, preset volume, preset pitch, preset Two or more of the tone color, preset intonation, and preset prosody rhythm also include language style characteristics.
下面详细描述这些语言风格特征:These language style features are described in detail below:
用户的口头禅是指用户习惯在有意或无意间时常说的语句,比如,有些人在惊讶的情绪下,会在一句话前面加上一句“有没有搞错啊?”,有些人经常在句子中间加入“可能”“也许”这样不确定的词汇,此外,口头禅还可能包括常用语气词,如谐星小沈阳的标志性语气词“嚎”,会经常出现在语句的结尾处。User mantras are sentences that users are used to saying intentionally or inadvertently. For example, in the mood of surprise, some people will add a sentence "Are you mistaken?" In front of a sentence, and some people are often in the middle of sentences Add uncertain words such as "may" and "maybe". In addition, the mantra may also include common mood words, such as the iconic mood word "嚎" of the comedian Xiaoyang, which often appears at the end of sentences.
对特定场景的应答是指在某种特定的场景下,或对某个特定的问句,一个人最常用的答复。比如对“去哪里吃饭”这样的询问,某个人的特定场景应答可能是“随便”;又比如对“您要什么啤酒”这样的询问,某个人的特定场景应答可能是“青岛啤酒”,等等。The response to a specific scenario refers to the response most commonly used by a person in a specific scenario or to a specific question. For example, to a question like “Where to eat”, a person ’s response to a particular scenario may be “any casual”; for another question to a question “What beer do you want”, a person ’s response to a specific scenario may be “Tsingtao Beer”, etc. Wait.
智慧类型是用来区分不同人群对不同内容呈现方式的理解能力倾向,智慧类型进一步包括如下几种:语言智能类型,这样的人阅读能力强,喜欢看文字描述的内容,玩文字游戏,善长写诗或写故事;逻辑数理智能类型,这样的人比较理智,善于运算,对数字敏感;音乐智能类型,这样的人对旋律和声音敏感,喜欢音乐,当有音乐在背景中时学习效率更高;空间智能类型,这样的人对周围环境敏感,喜欢读图表,善长绘画;运动智能类型,这样的人善长运用自己的身体,喜欢运动、动手制作;人际关系智能类型,这样的人善长理解和与他人沟通;自省智能类型,这样的人喜欢独立思考,自已设立目标;自然观察者智能类型,这样的人对星球上自然的生物感兴趣。对于同一个问题,上述不同智慧类型的人会有不同的回答,举例来说,对于问题“天上星星有多少颗?”,逻辑数理智能类型的人的回答可能是“肉眼可见的星星有6974颗”,而对于语言智能类型的人的回答可能是“七 八个星天外,两三点雨山前。”而对于音乐智能类型的人可能会以歌曲来回答“天上的星星数不清,最亮的是你”(歌曲《双子星》),等等。The type of wisdom is used to distinguish the tendency of different groups to understand the different ways of presenting content. The type of wisdom further includes the following types: language intelligence type, such people have strong reading ability, like to read text descriptions, play word games, and be good at it Write poems or write stories; logical and mathematical intelligence types, such people are more sane, good at calculations, and sensitive to numbers; music intelligent types, such people are sensitive to melody and sound, like music, and learn more efficiently when there is music in the background High; space intelligence type, such people are sensitive to the surrounding environment, like to read charts, good at painting; sports intelligence type, such people are good at using their own body, like sports, hands-on production; intelligent type of interpersonal relationship, such people Good at understanding and communicating with others; introspective intelligence types, such people like to think independently and set their own goals; natural observer intelligence types, such people are interested in natural creatures on the planet. For the same question, people of different types of intelligence will have different answers. For example, for the question "How many stars are there in the sky?", The answer of the person of logical and mathematical intelligence type may be "the stars visible to the naked eye are 6,974 "And the answer for people with intelligent language types may be" seven or eight stars away, two or three points before the rainy mountain. "And for people with intelligent music types, they may answer" the countless stars in the sky, the brightest. " It's you "(song" Gemini "), and so on.
性格类型是指不同性格特点的人对应的不同的语言风格。举例来说,性格稳重的人,其语言风格比较严谨;性格活泼的人,其语言风格幽默风趣;性格内向的人,其语言委婉含蓄,等等。Personality types refer to different language styles corresponding to people with different personalities. For example, a person with a stable personality has a strict language style; a person with a lively personality has a humorous language style; a person with an introverted personality has a euphemistic language, and so on.
说话中夹杂方言是指一个人在说话时喜欢夹杂本国方言或者外文等,比如说谢谢的时候喜欢用粤语“唔该”或英语“Thank you”。说话中夹杂流行语言是指一个人在说话时喜欢夹杂用当前流行的词汇或网络用语来代替特定词汇,如一个人难过的时候说“蓝瘦香菇”来替代“难受”。Speaking mixed dialects mean that a person likes to mix with native dialects or foreign languages when speaking. For example, when you are thankful, you like to use Cantonese “唔” or English “Thank you”. The inclusion of popular language in speaking means that when a person speaks, he or she likes to use the current popular words or network words instead of specific words. For example, when a person is sad, they say "blue thin mushrooms" instead of "uncomfortable".
对特定人物的称谓是指对特定的人采用特定的称呼,如用户对特定人物王小明称呼为“王老师”或“老王”等等。The title of a specific person refers to the use of a specific name for a specific person, for example, the user calls a specific person Wang Xiaoming as "Mr. Wang" or "Lao Wang" and so on.
本发明具体实施例中,终端的语音应答系统可通过学习得到用户身份关联的语音特征、语言风格特征。具体实现中,可预先通过特征迁移的方法来获取和分析学习用户喜好,也就是说,可根据用户对其他维度信息的获取情况来确定用户的需求,从而进一步推测和判断用户可能喜爱的语音特征和语言风格特征。In a specific embodiment of the present invention, the voice response system of the terminal can obtain the voice characteristics and language style characteristics associated with the user identity through learning. In specific implementation, user preferences can be acquired and analyzed through feature migration in advance, that is, user needs can be determined according to the user's acquisition of information in other dimensions, thereby further inferring and judging the voice features that users may like. And language style characteristics.
例如,可分析和统计用户喜爱的歌曲的特征,根据该歌曲的节奏强弱特征来确定合成语音的语速(音速)的快慢以及韵律的强弱特征;根据歌曲对应歌手的嗓音特征来确定合成语音的音色特征;根据该歌曲的歌词的风格特征来确定合成语音的语言风格特征等。又例如,可分析和统计用户喜爱的电视节目、社交媒体内容等维度的特征,进行特征迁移模型的训练,从而应用该模型来推测用户可能喜爱的语音特征和语言风格特征For example, the characteristics of the song that the user likes can be analyzed and counted, and the speed of the speech (sound rate) of the synthesized speech and the strength of the rhythm are determined according to the rhythmic strength of the song; the synthesis is determined according to the voice characteristics of the singer corresponding to the song The timbre characteristics of speech; determine the linguistic style characteristics of synthesized speech according to the style characteristics of the lyrics of the song. As another example, it can analyze and count the features of the user's favorite TV programs, social media content and other dimensions, and train a feature transfer model, so that the model can be used to infer the user's favorite voice features and language style features.
本发明具体实施例中,终端的语音应答系统还可通过多模态信息来获取和分析用户喜好,也就是说,通过对用户表情、关注度以及操作行为的统计,自动分析和推测用户对合成语音特征的喜好或需求。通过多模态的分析,不仅可以在产生个性化合成语音之前搜集用户对合成语音的需求,也可以在产生个性化语音产生之后,持续跟踪用户对该语音的喜好程度,根据此信息迭代优化合成语音的特征。In a specific embodiment of the present invention, the terminal's voice response system can also obtain and analyze user preferences through multi-modal information, that is, by analyzing the user's expressions, attention levels, and operating behaviors, it automatically analyzes and infers the user's response to synthesis. Preference or demand for phonetic features. Through multi-modal analysis, not only can the user's demand for synthesized speech be collected before generating personalized synthesized speech, but also the user's preference for the speech can be continuously tracked after the personalized speech is generated, and the synthesis can be iteratively optimized based on this information Features of speech.
例如,可通过对用户在听到不同合成语音的表情进行情绪分析,可以间接获取用户对不同语音的喜好程度;又例如,可通过对用户在听到不同合成语音的关注度分析(关注度可以通过用户的表情信息获取,也可以通过用户的可穿戴设备获取的脑电或者生物电信号获得)来间接获取用户对于不同语音的喜好程度;又例如,可通过用户在听到不同合成语音时的操作习惯(如跳过语音或者快速播放语音可能代表用户不是很喜欢该语音)来间接获取用户对于不同合成语音的喜好程度。For example, by analyzing the emotions of users when they hear different synthesized voices, they can indirectly obtain the user's preference for different voices; for example, by analyzing the degree of attention of users when they hear different synthesized voices (the degree of attention can be Obtained through the user's facial expression information, or through the EEG or bioelectric signals obtained by the user's wearable device) to indirectly obtain the user's preference for different voices; for example, the user's Operating habits (such as skipping a voice or playing a voice quickly may indicate that the user does not like the voice very much) to indirectly obtain the user's preference for different synthesized voices.
下面分别描述具有特定人物声音特点的声学模型和通过多种声学模型融合而得到的综合模型(或称融合模型)。The following describes the acoustic model with the sound characteristics of a specific character and a comprehensive model (or fusion model) obtained by fusing multiple acoustic models.
(1)对于具有特定人物声音特点的声学模型,相对于普通人,电影、电视剧、动画片、网络视频作品等等影视作品中的人物角色(例如林志玲)或者配音(例如周星驰的配音)的语音表现能力更强、更生动有趣。此外,很多影视作品中经典的台词字段能够给人带来直接且强烈的情感表达。借助人们对这些人物角色或者配音或台词所表达情感的认知,可设置具体人物声音特点的声学模型来让合成语音的发音特征符合与这些人物角色或者配音 或台词的声音特征,从而有效增强合成语音的表现能力和趣味性。(1) For the acoustic model with the sound characteristics of specific characters, compared with ordinary people, the voices of characters (such as Lin Zhiling) or dubbing (such as the dubbing of Zhou Xingchi) in movies, TV series, cartoons, online video works, etc. More expressive, more vivid and interesting. In addition, the classic lines in many film and television works can bring direct and strong emotional expression. Based on people's recognition of the emotions expressed by these personas or dubbing or lines, an acoustic model of specific person's voice characteristics can be set to make the pronunciation characteristics of synthesized speech conform to the sound characteristics of these personas or dubbing or lines, thereby effectively enhancing the synthesis Speech performance and fun.
(3)对于通过多种声学模型融合而得到的综合模型,由于声学模型库中有多种声学模型,那么可预先获取了用户对语音的喜好或需求,然后将多种声学模型中的若干个模型进行融合,例如可将模仿“林志玲/柔美声音”的声学模型和模仿“小沈阳/搞笑声音”的声学模型进融合;例如也可将用户自身的语音特征、语言风格特征或者用户所喜欢的人物形象的语音特征、语言风格特征与某些影视作品中的人物形象对应的声音模型(如“林志玲/柔美声音”的声学模型、“小沈阳/搞笑声音”的声学模型)进行融合,从而得到最终的声学模型用于后续的语音合成。(3) For the comprehensive model obtained by the fusion of multiple acoustic models, because there are multiple acoustic models in the acoustic model library, the user's preferences or needs for speech can be obtained in advance, and then several of the multiple acoustic models can be obtained. Models can be fused, for example, an acoustic model that imitates "Lin Zhiling / Soft Voice" and an acoustic model that imitates "Little Shenyang / Funny Voice"; for example, the user's own voice characteristics, language style characteristics, or users' favorite The vocal characteristics and language style characteristics of the character image are combined with the sound models (such as the acoustic model of "Lin Zhiling / Soft Voice" and the acoustic model of "Little Shenyang / Funny Voice") corresponding to the character image in some film and television works to obtain The final acoustic model is used for subsequent speech synthesis.
下面描述一种具体的模型融合方式,这种方式中,声学模型库中的多个个性化声学模型的声音可分别用于实现浑厚、柔美、可爱、搞笑等类型的声音。终端在获取了用户对语音的喜好或需求(这些喜好或需求直接与用户的身份)相关联之后,确定用户对所述若干个声学模型各自的喜好系数,这些喜好系数表示对应的声学模型的权重值;其中,所述各个声学模型的权重值是用户根据自身的需求而预先手动设置的,或者,所述各个声学模型的权重值是终端预先通过学习用户的喜好而自动确定的。然后,终端可将所述各个声学模型基于所述权重值进行加权叠加,从而融合得到综合的声学模型。The following describes a specific model fusion method. In this method, the sounds of multiple personalized acoustic models in the acoustic model library can be used to achieve thick, soft, cute, funny and other types of sounds. After acquiring the user's preferences or needs for the voice (these preferences or needs are directly related to the identity of the user), the terminal determines the user's respective preference coefficients for the several acoustic models, and these preference coefficients represent the weights of the corresponding acoustic models The weight value of each acoustic model is manually set in advance by a user according to his own requirements, or the weight value of each acoustic model is automatically determined by the terminal in advance by learning user preferences. Then, the terminal may perform weighted superposition on the respective acoustic models based on the weight value, so as to obtain a comprehensive acoustic model by fusion.
具体的,在获取了用户对语音的喜好或需求之后,终端可以根据用户喜欢的语音特征、语言风格特征,选取其用户喜好或需求最高的一个或者几个维度的特征,在多个声学模型的声音里进行匹配,从而确定用户对不同声学模型的声音的喜爱系数,最终将各个声学模型的声音特征结合对应的喜爱系数进行融合,从而得到终的语音场景参数。Specifically, after acquiring the user's voice preferences or needs, the terminal may select features of one or several dimensions with the highest user preferences or needs according to the voice characteristics and language style characteristics that the user likes, and select the features in multiple acoustic models. The sound is matched to determine the user's favorite coefficients for the sounds of different acoustic models. Finally, the sound characteristics of each acoustic model are combined with the corresponding favorite coefficients to obtain the final voice scene parameters.
举例来说,如图27所示,图27所示的表格中示例性地给出了各种声音类型(浑厚、柔美、搞笑)对应的声音特征,可以看到,不同的声音类型其对应的语速、语调、韵律节奏、音色各有差异。假如在终端获取了用户对语音的喜好或需求之后,也可直接根据用户的身份(即用户的喜好或需求直接绑定于用户的身份)在多个声学模型的声音里进行匹配,从而确定用户对浑厚、柔美、可爱、搞笑等声音类型的喜爱系数分别为0.2、0.8和0.5,即,即这些声学模型的权重分别为0.2、0.8和0.5,将每种声音类型的语速音速、语调、韵律节奏、音色等进行加权叠加,即可得到最终的声学模型(即融合模型)。这样合成的语音场景参数在语速、语调、韵律节奏、音色上实现了对声学模型的声音转换,有利于产生类似“说话风趣的林志玲”或者“说唱模式林志玲”这样混合的声音效果。For example, as shown in FIG. 27, the table shown in FIG. 27 exemplarily gives the sound characteristics corresponding to various sound types (thick, soft, funny). It can be seen that different sound types have corresponding sound characteristics. Speech speed, intonation, rhythm, and timbre are different. If the terminal obtains the user's preferences or needs for voice, the user can also directly match the sounds of multiple acoustic models according to the user's identity (that is, the user's preferences or needs are directly tied to the user's identity), thereby determining the user. The favorite coefficients for thick, soft, cute, funny, and other sound types are 0.2, 0.8, and 0.5 respectively, that is, the weights of these acoustic models are 0.2, 0.8, and 0.5, respectively. The speed, speed, intonation, The final acoustic model (that is, the fusion model) can be obtained by performing weighted superposition of prosody, timbre, and the like. The synthesized speech scene parameters realize the sound conversion of the acoustic model in terms of speech rate, intonation, rhythm, and timbre, which is conducive to producing mixed sound effects such as "talking Lin Zhiling" or "rapper mode Lin Zhiling".
本发明实施例并不限制于采用上述方式来获得多个声学模型综合的模型(简称融合模型),例如在可能实施例中,也可基于用户主动向TTS参数库输入人物模仿数据或者用户向终端发出语音请求来形成最终的声学模型。举例来说,在一应用场景中,终端可以提供一种图形用户界面或语音交互界面,由用户根据其喜爱自行选择各个语音特征的参数和语言风格特征的参数,如图28所示,图28示出了一种语音特征的参数和语言风格特征的参数的选择界面。用户在该选择页面中,选择语音特征为“林志玲”声音的声学模型对应的语音特征,亦即将“林志玲”类型的声学模型对应的语音特征中的“语速、语调、韵律节奏、音色”等子参数的参数值作为融合模型对应的语音特征中的“语速、语调、韵律节奏、音色”等子参数的参数值。用户选择语言风格特征为“小沈阳”声音的声学模型对应的语言风格特征,亦即将“小沈阳”声音的声学模型对应的语言风格特征语言风格特征的“口头 禅、对特定场景的应答、智慧类型、性格类型、夹杂方言/流行语言”等子参数的参数值作为融合模型对应的语言风格特征语言风格特征的“口头禅、对特定场景的应答、智慧类型、性格类型、夹杂方言/流行语言”等子参数的参数值。The embodiment of the present invention is not limited to using the above-mentioned method to obtain a comprehensive model of multiple acoustic models (abbreviated as a fusion model). For example, in a possible embodiment, it is also possible to input character imitation data into the TTS parameter database based on the user's initiative or the user to the terminal. A voice request is made to form the final acoustic model. For example, in an application scenario, the terminal may provide a graphical user interface or a voice interaction interface, and the user may select parameters of each voice feature and parameters of language style features according to his preference, as shown in FIG. 28, FIG. 28 A selection interface of parameters of speech features and parameters of language style features is shown. In this selection page, the user selects the speech feature corresponding to the acoustic model of the "Lin Zhiling" sound, that is, the "speech rate, intonation, rhythm, tone color" in the speech feature corresponding to the "Lin Zhiling" type acoustic model. The parameter values of the sub-parameters are used as the parameter values of the sub-parameters such as "speech rate, intonation, prosody, and tone color" in the speech features corresponding to the fusion model. The user selects the language style feature as the language style feature corresponding to the acoustic model of the "Little Shenyang" sound, which is also the language style feature corresponding to the acoustic model of the "Little Shenyang" sound. The "mantra, response to specific scenes, wisdom type" , Character type, mixed dialect / popular language ”and other sub-parameter parameter values as the fusion style's corresponding language style features, language style features,“ mantra, response to specific scenes, wisdom type, personality type, mixed dialect / popular language ”, etc. The parameter value of the child parameter.
例如,用户可预先向终端发出文字或语音的请求“请用林志玲的声音按小沈阳的语言风格来说话”,则终端的语音应答系统解析出用户的设置意图为将融合模型的语音特征中的语速、语调、韵律节奏和音色设置为“林志玲”声音的声学模型的语音特征的相关子参数值,且将融合模型的语言风格特征中的口头禅、对特定场景应答、智慧类型、性格类型和夹杂方言/流行语言设置为“小沈阳”声音的声学模型的语言风格特征的相关子参数值。For example, the user may send a text or voice request to the terminal in advance "Please use Lin Zhiling's voice to speak in Xiao Shenyang's language style", then the terminal's voice response system resolves that the user's settings are intended to integrate the voice characteristics of the fusion model into the Speech speed, intonation, prosody, and timbre are set to the relevant sub-parameter values of the speech features of the acoustic model of the "Lin Zhiling" sound, and the mantra, response to specific scenes, type of intelligence, personality type, and The value of the relevant sub-parameters for the language style feature of the acoustic model with the dialect / popular language set to the sound of "Little Shenyang".
此外,在本发明可能的实施例中,终端也可根据用户的身份确定所述用户喜好的声学模型,这样终端在声音合成过程中可从所述声学模型库的多个声学模型中直接选取所述用户喜好的声学模型。In addition, in a possible embodiment of the present invention, the terminal may also determine the acoustic model preferred by the user according to the identity of the user, so that the terminal can directly select all acoustic models from the acoustic model library during the sound synthesis process. Describe the user's favorite acoustic model.
需要说明的是,所述用户喜好的声学模型未必是声学模型库中原本设置的个性化声学模型,而可能是根据用户的喜好对某个性化声学模型进行参数微调后的声学模型。举例来说,声学模型库中原本设置的某一个个性化声学模型的声音特征包括第一语速(音速)、第一语调、第一韵律节奏、第一音色。终端通过对用户喜好的分析或者用户的手动设置,确定用户最喜欢的各种参数组合为:0.8倍第一语速,1.3倍第一语调,0.9倍第一韵律节奏,1.2倍第一女性化音色,从而对这些参数进行相应调整,从而得到满足用户需求的个性化声学模型。It should be noted that the acoustic model preferred by the user is not necessarily the personalized acoustic model originally set in the acoustic model library, but may be an acoustic model obtained by fine-tuning parameters of a personalized acoustic model according to the preference of the user. For example, the sound characteristics of a personalized acoustic model originally set in the acoustic model library include a first speech speed (speed of sound), a first intonation, a first rhythm, and a first tone color. The terminal determines the user's favorite various parameter combinations through analysis of user preferences or manual settings by the user: 0.8 times the first speech speed, 1.3 times the first intonation, 0.9 times the first rhythm, and 1.2 times the first feminization Tone color, so that these parameters are adjusted accordingly to obtain a personalized acoustic model that meets user needs.
步骤602、终端通过用户的输入语音确定当前对话需要采用“人物模仿”的声学模型。Step 602: The terminal determines, through the input voice of the user, that the current dialogue needs to adopt an acoustic model of "character imitation."
具体实施例中,终端可通过DM模块确定当前对话的对话需要设置为“人物模仿”的场景。具体的,DM模块确定当前对话为“人物模仿”的语音场景的方式可包括以下几种:In a specific embodiment, the terminal may determine, through the DM module, a scene in which the dialogue of the current dialogue needs to be set to "character imitation". Specifically, the manner in which the DM module determines the current scene as a voice scene of "character imitation" may include the following:
一种方式是在对话过程中,用户的输入语音所包含的用户意图明确指示当前对话为“人物模仿”的场景,DM模块结合意图识别模块确定了用户意图后,进而确定当前对话为“人物模仿”的场景。举例来说,用户输入语音指示终端用林志玲的声音说话,那么终端识别出用户意图后,自动将当前对话场景设置为“人物模仿”的场景。One way is that during the dialogue, the user's input voice contains the user's intention to clearly indicate that the current dialogue is "character imitation". After the DM module and the intent recognition module determine the user's intention, the current dialog is determined to be "character imitation" "Scene. For example, if the user inputs a voice to instruct the terminal to speak with Lin Zhiling's voice, then after the terminal recognizes the user's intention, the terminal automatically sets the current dialogue scene as a "character imitation" scene.
一种方式是在普通对话中,用户虽没有明确的意图明确指示当前对话为“人物模仿”,但终端还是可通过DM模块也可以判断用户的输入语音对应的输入文本的内容是否涉及了人物模仿的内容。具体实现中,DM模块可通过全文匹配、关键词匹配和语义相似度匹配等方式来确定可以进行角色模仿的回复内容,这些内容包括歌词、声音特效、电影台词和动画片对话脚本等。其中,全文匹配的方式是指输入的文本与对应影视或音乐作品的一部分完全相同,关键词匹配的方式是指输入的文本与影视或音乐作品的一部分关键字相同,语义相似度匹配的方式是指输入的文本与影视或音乐作品的一部分语义相似匹配。One way is that in a normal conversation, although the user has no clear intention to clearly indicate that the current conversation is "character imitation," the terminal can still determine whether the content of the input text corresponding to the user's input voice involves character imitation through the DM module. Content. In specific implementation, the DM module can determine the reply content that can be imitated by characters, such as full-text matching, keyword matching, and semantic similarity matching. These contents include lyrics, sound effects, movie lines, and animation dialogue scripts. Among them, the method of full-text matching means that the input text is exactly the same as a part of the corresponding movie or music work, the method of keyword matching means that the input text is the same as a part of keywords of a movie or music work, and the way of semantic similarity matching is Refers to the similarity between the input text and a part of the film or music works.
举例来说,输入文本为“他已经当过主角了,他讲到白日梦不是错,没有梦想的人才是咸鱼。在为梦想拼搏的这条路上,我努力过了就会有收获那就够了。”采用上述方式经过内容的匹配后,发现输入文本中的“没有梦想的人才是咸鱼”为属于可匹配内容,其匹配的内容是电影《少林足球》中的台词“做人要是没有理想,和咸鱼有什么区别”,语音为角色“周星驰”的配音。那么,则将当前对话设置为“人物模仿”的场景。For example, the input text is "He has been the protagonist, he said that daydreaming is not wrong, and people without dreams are salted fish. On this way of fighting for dreams, I will gain something after working hard. That ’s enough. ”After matching the content in the above manner, it was found that“ the talent without dreams is a salted fish ”in the input text is the matching content, and the matching content is the line“ Shaolin Football ” There is no ideal, what is the difference with salted fish ", the voice is the voice of the character" Chou Xingchi ". Then, the current conversation is set as a "personal imitation" scene.
步骤603、终端从声学模型库中获取用于实现“人物模仿”对应的声学模型。Step 603: The terminal obtains an acoustic model corresponding to "character imitation" from an acoustic model library.
本发明一具体实施例中,终端可根据用户喜好从声学模型库中选取某一个声学模型或者某一个融合模型。In a specific embodiment of the present invention, the terminal may select a certain acoustic model or a certain fusion model from the acoustic model library according to user preference.
本发明又一具体实施例中,终端根据所述当前输入语音的内容,确定与所述当前输入语音的内容相关的声模标识,从所述声学模型库中选取对应于所述声模标识的声学模型。例如,终端可根据输入文本或用户喜好或回复文本确定当前合成语音需要采用“周星驰”类型的声音,则从声学模型库中选取“周星驰”声音类型的声学模型。In another specific embodiment of the present invention, the terminal determines a sound mode identifier related to the content of the current input voice according to the content of the current input voice, and selects an acoustic model identifier corresponding to the sound mode identifier from the acoustic model library. Acoustic model. For example, the terminal may determine that the currently synthesized speech needs to use a "Chou Xingchi" type of sound according to the input text or user preference or reply text, and then select an acoustic model of the "Chou Xingchi" type of sound from the acoustic model library.
本发明又一具体实施例中,终端根据所述用户的身份选取所述声学模型中的多个声学模型后,确定所述多个声学模型中的各个声学模型的权重值(即喜好系数);其中,所述各个声学模型的权重值是用户预先设置的,或者,所述各个声学模型的权重值是预先根据所述用户的喜好而确定的;然后将所述各个声学模型基于所述权重值进行融合,获得融合后的声学模型。In another specific embodiment of the present invention, after the terminal selects a plurality of acoustic models in the acoustic model according to the identity of the user, the terminal determines a weight value (that is, a preference coefficient) of each acoustic model in the plurality of acoustic models; Wherein, the weight value of each acoustic model is set in advance by the user, or the weight value of each acoustic model is determined in advance according to the preference of the user; and then each acoustic model is based on the weight value The fusion is performed to obtain a fusion acoustic model.
步骤604、终端通过所选取的声学模型进行后续的语音合成。Step 604: The terminal performs subsequent speech synthesis by using the selected acoustic model.
举例来说,如果采用通用声学模型进行语音合成,那么当用户发出的输入语音内容为“今天晚上在哪里吃饭?”,终端可能原定的合成语音为“今晚在XX地方吃饭”。而在“人物模仿”的场景下,终端通过所选取的“林志玲”声学模型和“小沈阳”声学模型的融合模型,最终合成的语音为“你知道嘛?今晚在XX地方吃饭,嚎”。这样输出的语音中的语音特征采用了“林志玲”声学模型的相关参数,从而体现了合成语音的柔美特点。输出的语音中的语言风格特征采用了“小沈阳”声学模型的相关参数,从而体现了合成语音的诙谐搞笑特点。也就是说,这样输出的合成语音实现了“用林志玲的声音按小沈阳的语言风格说话”的合成效果。For example, if a general acoustic model is used for speech synthesis, when the input speech content issued by the user is "Where are you going to eat tonight?", The terminal may have originally synthesized speech as "Let's eat at XX tonight". In the "character imitation" scenario, the terminal uses the fusion model of the selected "Lin Zhiling" acoustic model and the "little Shenyang" acoustic model, and the final synthesized speech is "Do you know? Eat in XX place tonight, alas" . The speech features in the output speech use the relevant parameters of the "Lin Zhiling" acoustic model, thus reflecting the soft features of the synthesized speech. The language style features in the output speech use the relevant parameters of the "Little Shenyang" acoustic model, thus reflecting the witty and funny characteristics of the synthesized speech. In other words, the synthesized speech output in this way achieves the synthesis effect of "speaking in the language style of Shenyang with Lin Zhiling's voice".
需要说明的是,本发明上述实施例所列举的“诗词朗诵”“歌曲哼唱”“人物模仿”等场景可能在语音合成过程中单独使用,也可能在语音合成过程中综合使用。举例来说,对于“诗歌朗诵”语音场景与“人物模仿”语音场景的组合,假设输入文本为“用林志玲的声音按小沈阳的语言风格来读一首五言绝句”,终端选取声学模型库中的“林志玲”声学模型和“小沈阳”声学模型的融合模型,并采用了TTS参数库中的“诗词朗诵”的语音场景参数(即五言绝句对应的韵律节奏模板),对回复文本进行语音合成后最终输出的语音为“那我给你念一首诗呗,《登鹳雀楼》,你知道嘛?白日依山尽,黄河入海流,欲穷千里目,更上一层楼,嚎~”。也就是说,这段输出语音在合成过程中可既采用了如图28所示的“人物模仿”的融合模型,且在部分内容“白日依山尽,黄河入海流,欲穷千里目,更上一层楼”部分又采用了类似于图23所示的韵律节奏模板,从而即完成了与用户的实时语音交互,又满足用户的个性化需求,提升用户体验。It should be noted that the scenes such as "poetry recitation", "song humming", and "person imitation" enumerated in the above embodiments of the present invention may be used alone in the speech synthesis process, or they may be used comprehensively in the speech synthesis process. For example, for the combination of the "poetry recitation" voice scene and the "character imitation" voice scene, suppose the input text is "Read a five-character quatrain in Xiao Shenyang's language style using Lin Zhiling's voice", and the terminal selects the acoustic model library "Lin Zhiling" acoustic model and "Little Shenyang" acoustic model in the fusion model, and uses the TTS parameter library of "poetry recitation" voice scene parameters (that is, five-character quatrains corresponding to the rhythmic rhythm template) to the response text The final voice output after speech synthesis is "Then I read a poem for you," Climbing the Crane Tower ", do you know? As the days pass by the mountains, the Yellow River flows into the ocean, and you want to reach a new height. , 嚎 ~ ". In other words, during the synthesis of this output voice, the fusion model of "character imitation" as shown in Figure 28 can be used, and in some parts, "the day is full of mountains, the Yellow River flows into the sea, and the eyes are far away." A step up ”also uses a rhythmic rhythm template similar to that shown in FIG. 23, which not only completes the real-time voice interaction with the user, but also meets the user's personalized needs and improves the user experience.
本发明具体实施例中,在合成语音后,为了增强各种TTS参数的表达效果,还可以在输出合成语音时,叠加背景音效。下面以对合成语音叠加“背景音效”的场景为例来描述本发明实施例的语音合成方法,参见图29,该方法可通过以下几个步骤进行描述:In the specific embodiment of the present invention, after synthesizing speech, in order to enhance the expression effect of various TTS parameters, a background sound effect may also be superimposed when outputting the synthesized speech. In the following, the scenario of superimposing "background sound" on the synthesized speech is used as an example to describe the speech synthesis method of the embodiment of the present invention. Referring to FIG. 29, the method can be described by the following steps:
步骤701、终端预设有音乐库。Step 701: The terminal presets a music library.
具体实施例中,在所述终端的TTS参数库中预设有音乐库,所述音乐库包括多个音乐 文件,这些音乐文件用于在语音合成过程中提供背景音效,所述背景音效具体是指音乐中的某个音乐片段(如纯乐曲或歌曲)或者声音特效(如影视音效、游戏音效、语言音效、动画音效等)。In a specific embodiment, a music library is preset in the TTS parameter library of the terminal, and the music library includes multiple music files, and these music files are used to provide a background sound effect during the speech synthesis process. The background sound effect is specifically Refers to a certain piece of music (such as pure music or songs) or sound effects (such as film and television sound effects, game sound effects, language sound effects, animation sound effects, etc.).
步骤702、终端确定回复文本中具有适合叠加背景音乐的内容。Step 702: The terminal determines that the reply text has content suitable for superimposing background music.
具体实施例中,终端可通过DM模块确定适合叠加背景音乐的内容。这些适合叠加背景音乐的内容可以是具有情感极性的文字,可以是诗歌词曲,可以是影视台词等等。举例来说,终端可通过DM模块识别句子中有情感倾向的词语,进而通过语法规则分析、机器学习分类等方法来确定回复文本中的短语、句子或者整个回复文本的情感状态。这个过程,可借助情感词典来识别这些有情感倾向的词语,情感词典是一个词语集合,该集合内的词都有明显的情感极性倾向,且情感词典也包含了这些词语的极性信息,例如,词典中的文字被标识了如下情感极性:快乐(happy)、喜欢(like)、悲伤(sadness)、惊讶(surprise)、愤怒(angry)、恐惧(fear)、厌恶(disgust)等情感极性类型,可能实施例中,不同的情感极性类型甚至还可进一步划分为多种程度的情感强度(如划分为五档的情感强度)。In a specific embodiment, the terminal may determine, through the DM module, content suitable for superimposing background music. The content suitable for superimposing background music can be words with emotional polarity, can be poetry, can be film and television lines, and so on. For example, the terminal can identify the sentiment-oriented words in the sentence through the DM module, and then determine the emotional state of the phrase, sentence, or the entire reply text through methods such as grammatical rule analysis and machine learning classification. In this process, the emotional dictionary can be used to identify these emotionally inclined words. The emotional dictionary is a collection of words, and the words in the collection have obvious emotional polarity tendencies, and the emotional dictionary also contains the polarity information of these words. For example, the words in the dictionary are identified with the following emotional polarity: happy, like, sadness, surprise, angry, fear, disgust, and other emotions Polarity type. In a possible embodiment, different types of emotional polarity can be further divided into multiple levels of emotional intensity (for example, divided into five levels of emotional intensity).
步骤703、终端从所述音乐库确定要叠加的背景音效。Step 703: The terminal determines a background sound effect to be superimposed from the music library.
具体实施例中,终端通过PM模块确定TTS参数库中确定要叠加的背景音效。In a specific embodiment, the terminal determines the background sound effect to be superimposed in the TTS parameter database through the PM module.
举例来说,终端预先为音乐库中的各个音乐文件的不同片段(即子片段)设置情感极性类别的标识,例如这些片段被标识如下情感极性类型:快乐(happy)、喜欢(like)、悲伤(sadness)、惊讶(surprise)、愤怒(angry)、恐惧(fear)、厌恶(disgust)等。假设当前回复文本包括具有情感极性的文字,那么在步骤702确定了这些文字的情感极性类别后,终端通过PM模块在音乐库中查找具有相应的情感极性类别标识的音乐文件。在可能实施例中,如果情感极性类型还可进一步划分为多种程度的情感强度,则预先为音乐库中的各个子片段设置情感极性类别和情感强度的标识,那么在步骤702确定了这些文字的情感极性类别和情感强度后,在音乐库中查找具有相应的情感极性类别和情感强度的标识的子片段组合作为最终选取的背景音效。For example, the terminal sets the identification of the emotional polarity category for different segments (ie, sub-segments) of each music file in the music library in advance. For example, these segments are identified as the following emotional polarity types: happy, like , Sadness, surprise, anger, fear, disgust, etc. Assuming that the current response text includes texts with emotional polarity, after determining the emotional polarity categories of these texts in step 702, the terminal searches the music library through the PM module for a music file with the corresponding emotional polarity category identification. In a possible embodiment, if the type of emotional polarity can be further divided into multiple levels of emotional intensity, the emotional polarity category and the identification of the emotional intensity are set for each sub-segment in the music library in advance, then it is determined in step 702 After the emotional polarity category and emotional intensity of these texts, a sub-segment combination with the corresponding emotional polarity category and emotional intensity logo is found in the music library as the final selected background sound effect.
举例来说,假设当前回复文本包括诗歌/词/曲的内容,那么终端通过PM模块在音乐库中寻找与该诗歌/词/曲的内容相关的纯乐曲或歌曲或音乐特效,如果能找到,则将关的纯乐曲或歌曲作为要叠加的背景音效。另外,如果预先为音乐库中的各个背景音效设置了情感极性类别的标识,那么还可在确定回复文本所包括诗歌/词/曲的内容的情感极性类别后,在音乐库中查找具有相应的情感极性类别标识的背景音效。在可能实施例中,如果情感极性类型还可进一步划分为多种程度的情感强度,则预先为音乐库中的各个背景音效设置情感极性类别和情感强度的标识,在确定回复文本所包括诗歌/词/曲的内容的情感极性类别和情感强度后,在音乐库中查找具有相应的情感极性类别和情感强度的标识的背景音效。For example, if the current reply text includes the content of the poem / word / music, then the terminal searches the music library through the PM module for the pure music or song or music special effect related to the content of the poem / word / music. If it can be found, Use the pure song or song as the background sound effect to be superimposed. In addition, if the identification of the emotional polarity category is set for each background sound effect in the music library beforehand, after determining the emotional polarity category of the content of the poem / word / music included in the reply text, you can find in the music library The background sound effect identified by the corresponding emotional polarity category. In a possible embodiment, if the type of emotional polarity can be further divided into multiple levels of emotional intensity, an emotional polarity category and an identification of the emotional intensity are set in advance for each background sound effect in the music library. After the emotional polarity category and emotional intensity of the content of the poem / word / music, find the background sound effect with the corresponding emotional polarity category and emotional intensity logo in the music library.
举例来说,假设当前回复文本包括诗“人物模仿”的内容,那么终端可通过PM模块在音乐库中寻找与该人物模仿的声模相关的纯乐曲或歌曲或音乐特效,例如所模仿的人物为声模“小沈阳”,那么可在在音乐库中查找声模“小沈阳”相关的歌曲(如歌曲《我叫小沈阳》),进一步地,可以再根据对话的场景或回复文本的内容选择该歌曲中的某个歌曲片段作为最终的背景音效。For example, assuming that the current reply text includes the content of the poem "character imitation", then the terminal can search the music library through the PM module for pure music or songs or music special effects related to the acoustic model imitated by the character, such as the imitated character For the voice mode "Little Shenyang", then you can find songs related to the voice mode "Little Shenyang" in the music library (such as the song "My Name is Little Shenyang"). Further, you can follow the dialogue scene or the content of the reply text Select a song clip from the song as the final background sound.
步骤704、终端将回复文本对齐所确定的背景音效,以便于后续的语音合成。Step 704: The terminal aligns the background sound effect determined by the reply text alignment to facilitate subsequent speech synthesis.
具体实施例中,终端可将回复文本中需要叠加背景音效地内容拆分成不同的部分(根据标点进行拆分或者进行分词处理),每个部分可称为一个子内容,计算每个子内容的情感极性类型和情感强度。进而,确定将该内容所匹配的背景音效后,将该内容对齐所匹配的背景音效,即该内容的情感变化与背景音效的情感变化基本一致。In a specific embodiment, the terminal can split the content in the reply text that needs to be superimposed with background sound effects into different parts (split or punctuated according to punctuation), and each part can be called a sub-content. Emotional polarity types and emotion intensity. Furthermore, after the background sound effect matched with the content is determined, the content is aligned with the matched background sound effect, that is, the emotional change of the content is basically consistent with the emotional change of the background sound effect.
举例来说,参见图30,在一应用场景中,回复文本为“天气不错,国足又赢球了,好开心”,该回复文本的全部内容需要叠加背景音效,该回复文本拆分成“天气不错,”“国足又赢球了,”“好开心”三部分的子内容,且该各部分的情感极性类别皆为高兴(happy),情感强度分别为0.48、0.60、0.55(由图中下半部分的黑点所表示),各部分的读音总长度分别0.3s,0.5s,0.2s。通过上述步骤703已初步确定了一个情感极性类别为高兴(happy)的音乐文件,进一步地,可以对该音乐文件的情感变化轨迹进行计算和统计,得到该音乐中各个部分的情感强度。如图30中的波形图代表一段音乐,该音乐可划分为15个小片段,每个小片段音长为0.1s,根据其各个小片段的音强、节奏等参数,通过固定规则或者分类器进行计算,获得每一个小片段的情感强度,这15个小片段的情感强度分别为:0.41,0.65,0.53,0.51,0.34,0.40,0.63,0.43,0.52,0.33,0.45,0.53,0.44,0.42,0.41(由图中上半部分的黑点所表示)。可以看到,对于由第4、5、6小片段所构成的子片段,总音长为0.3s,且其中的最大情感强度为0.51(源于第4片段的情感强度0.51);对于由第7、8、9、10、11小片段所构成的子片段,总音长为0.5s,且其中的最大情感强度为0.63(源于第7片段的情感强度0.63);对于由第12、13小片段所构成的子片段,总音长为0.2s,且其中的最大情感强度为0.53(源于第4片段的情感强度0.53)。也就是说,这三个子片段的情感变化与回复文本的三部分的子内容的情感变化趋势基本一致(如图示中两个折线的变化轨迹基本一致),所以由这个音乐文件中的这三个子片段组成的音乐片段即为与回复文本相匹配的背景音效。故可以将复文本的“天气不错,”“国足又赢球了,”“好开心”分别对齐这三个子片段,以便于在后续语音合成过程产生“语音叠加背景音效”的效果。For example, referring to FIG. 30, in an application scenario, the reply text is "The weather is good, the national football team has won again, so happy." The entire content of the reply text needs to be superimposed with background sound effects. The reply text is split into "Weather Yes, "The national football team won again," "Happiness" is a sub-content of the three parts, and the emotional polarity category of each part is happy, and the emotional intensity is 0.48, 0.60, 0.55 (from the figure (Indicated by black dots in the lower half), the total length of the pronunciation of each part is 0.3s, 0.5s, 0.2s. Through step 703, a music file whose emotional polarity category is happy has been initially determined. Further, the emotional change track of the music file can be calculated and counted to obtain the emotional intensity of each part of the music. The waveform shown in Figure 30 represents a piece of music. The music can be divided into 15 small fragments, each of which has a sound length of 0.1s. According to the parameters such as the sound intensity and rhythm of each small fragment, through fixed rules or classifiers Calculate the emotional intensity of each small segment. The emotional intensity of these 15 small segments are: 0.41,0.65, 0.53, 0.51,0.34, 0.40, 0.63, 0.43, 0.52, 0.33, 0.45, 0.53, 0.44, 0.42 0.41 (indicated by the black dot in the upper half of the figure). It can be seen that for the sub-segments composed of the fourth, fifth, and sixth segments, the total sound length is 0.3s, and the maximum emotional intensity is 0.51 (originated from the emotional intensity of the fourth segment 0.51); The sub-segments composed of 7, 8, 9, 10, and 11 segments have a total sound length of 0.5s, and the maximum emotional intensity is 0.63 (derived from the emotional intensity of the seventh segment 0.63); The sub-segment composed of small fragments has a total sound length of 0.2s, and the maximum emotional intensity is 0.53 (originated from the emotional intensity of the fourth fragment of 0.53). In other words, the emotional changes of the three sub-segments are basically consistent with the emotional changes of the sub-contents of the three parts of the reply text (for example, the change trajectories of the two polylines in the illustration are basically the same), so the three The music segment composed of the sub-segments is the background sound effect that matches the reply text. Therefore, you can align the three sub-segments of "the weather is good," "the national football team has won again," and "happy" in the complex text, so as to produce the "speech superimposed background sound" effect in the subsequent speech synthesis process.
上文详细阐述了本发明实施例的系统框架、终端设备以及相关的语音合成方法,基于相同的发明构思,下面继续提供本发明实施例的硬件设备。The system framework, the terminal device, and the related speech synthesis method according to the embodiments of the present invention have been described in detail above. Based on the same inventive concept, the hardware devices according to the embodiments of the present invention are provided below.
参见图31,图31是本发明实施例提供的一种语音合成设备200的结构示意图,如图31所示,设备200可以包括一个或者多个处理器2011、一个或多个存储器2012、音频电路2013。具体实现中,设备200还可以进一步包括输入单元2016、显示单元2019等部件,处理器2011可通过总线分别连接存储器2012、音频电路2013、输入单元2016、显示单元2019等部件。分别描述如下:Referring to FIG. 31, FIG. 31 is a schematic structural diagram of a speech synthesis device 200 according to an embodiment of the present invention. As shown in FIG. 31, the device 200 may include one or more processors 2011, one or more memories 2012, and audio circuits. 2013. In specific implementation, the device 200 may further include an input unit 2016, a display unit 2019, and other components. The processor 2011 may be connected to the memory 2012, the audio circuit 2013, the input unit 2016, and the display unit 2019 through a bus, respectively. They are described as follows:
处理器2011是设备200的控制中心,利用各种接口和线路连接设备200的各个部件,在可能实施例中,处理器2011还可包括一个或多个处理核心。处理器2011可通过运行或执行存储在存储器2012内的软件程序(指令)和/或模块,以及调用存储在存储器2012内的数据来执行语音合成(比如执行图4或图9实施例中的各种模块的功能以及处理数据),以便于实现设备200与用户之间的实时语音对话。The processor 2011 is a control center of the device 200, and uses various interfaces and lines to connect various components of the device 200. In a possible embodiment, the processor 2011 may further include one or more processing cores. The processor 2011 may perform speech synthesis by running or executing software programs (instructions) and / or modules stored in the memory 2012, and calling data stored in the memory 2012 (such as executing each of the embodiments in FIG. 4 or FIG. 9). Functions of this module and processing data) to facilitate real-time voice conversation between the device 200 and the user.
存储器2012可以包括高速随机存取存储器,还可以包括非易失性存储器,例如至少一个磁盘存储器件、闪存器件、或其他易失性固态存储器件。相应地,存储器2012还可以包 括存储器控制器,以提供处理器2011和输入单元2017对存储器2012的访问。存储器2012可具体用于存储软件程序(指令)、以及数据(声学模型库中的相关数据、TTS参数库中的相关数据)。The memory 2012 may include a high-speed random access memory, and may further include a non-volatile memory, such as at least one magnetic disk storage device, a flash memory device, or other volatile solid-state storage device. Accordingly, the memory 2012 may further include a memory controller to provide the processor 2011 and the input unit 2017 with access to the memory 2012. The memory 2012 may be specifically used to store software programs (instructions) and data (relevant data in the acoustic model library, relevant data in the TTS parameter library).
音频电路2013可提供设备200与用户之间的音频接口,音频电路2013可进一步连接有扬声器2014和传声器2015。一方面,传声器2015可收集用户的声音信号,并将收集的声音信号转换为电信号,由音频电路2013接收后转换为音频数据(即形成用户的输入语音),再将音频数据传输至处理器2011进行语音处理,另一方面,处理器2011基于用户的输入语音来合成回复语音后,传输至音频电路2013,音频电路2013可将接收到的音频数据(即回复语音)转换后的电信号,进而传输到扬声器2014,由扬声器2014转换为声音信号输出,从而实现将回复语音呈现给用户,从而达到了设备200与用户之间的实时语音对话的目的。The audio circuit 2013 may provide an audio interface between the device 200 and a user, and the audio circuit 2013 may further be connected with a speaker 2014 and a microphone 2015. On the one hand, the microphone 2015 can collect the user's sound signals, and convert the collected sound signals into electrical signals, which are received by the audio circuit 2013 and converted into audio data (that is, forming the user's input voice), and then the audio data is transmitted to the processor. 2011 performs voice processing. On the other hand, the processor 2011 synthesizes the reply voice based on the user's input voice and transmits it to the audio circuit 2013. The audio circuit 2013 can convert the received audio data (that is, the reply voice) into an electrical signal. It is further transmitted to the speaker 2014, which is converted into a sound signal output by the speaker 2014, so that the reply voice is presented to the user, thereby achieving the purpose of real-time voice conversation between the device 200 and the user.
输入单元2016可用于接收用户输入的数字或字符信息,以及产生与用户设置以及功能控制有关的键盘、鼠标、操作杆、光学或者轨迹球信号输入。具体地,输入单元2017可包括触敏表面2017以及其他输入设备2018。触敏表面2017也称为触摸显示屏或者触控板,可收集用户在其上或附近的触摸操作,并根据预先设定的程式驱动相应的连接装置。具体地,其他输入设备2018可以包括但不限于物理键盘、功能键、轨迹球、鼠标、操作杆等中的一种或多种。The input unit 2016 may be used to receive digital or character information input by a user, and generate a keyboard, mouse, joystick, optical or trackball signal input related to user settings and function control. Specifically, the input unit 2017 may include a touch-sensitive surface 2017 and other input devices 2018. The touch-sensitive surface 2017 is also referred to as a touch display screen or a touchpad, which can collect user's touch operations on or near it and drive the corresponding connection device according to a preset program. Specifically, other input devices 2018 may include, but are not limited to, one or more of a physical keyboard, function keys, trackball, mouse, joystick, and the like.
显示单元2019可用于显示由用户输入的信息或设备200提供给用户的信息(如回复语音的相关标识或者文字)以及设备200的各种图形用户接口,这些图形用户接口可以由图形、文本、图标、视频和其任意组合来构成。具体的,显示单元2019可包括显示面板2020,可选的,可以采用液晶显示器(Liquid Crystal Display,LCD)、有机发光二极管(Organic Light-Emitting Diode,OLED)等形式来配置显示面板2020。虽然在图31中,触敏表面2017与显示面板2020是作为两个独立的部件,但是在某些实施例中,可以将触敏表面2017与显示面板2020集成而实现输入和输出功能。例如,触敏表面2017可覆盖显示面板2020,当触敏表面2017检测到在其上或附近的触摸操作后,传送给处理器2011以确定触摸事件的类型,随后处理器2011根据触摸事件的类型在显示面板2020上提供相应的视觉输出。The display unit 2019 can be used to display information input by the user or information provided by the device 200 to the user (such as the relevant identification or text for replying to speech) and various graphical user interfaces of the device 200. These graphical user interfaces can be composed of graphics, text, and icons. , Video, and any combination thereof. Specifically, the display unit 2019 may include a display panel 2020. Optionally, the display panel 2020 may be configured by using a liquid crystal display (Liquid Crystal Display, LCD), an organic light emitting diode (Organic Light-Emitting Diode, OLED), and the like. Although in FIG. 31, the touch-sensitive surface 2017 and the display panel 2020 are two separate components, in some embodiments, the touch-sensitive surface 2017 and the display panel 2020 may be integrated to implement input and output functions. For example, the touch-sensitive surface 2017 may cover the display panel 2020. When the touch-sensitive surface 2017 detects a touch operation on or near the touch-sensitive surface 2017, it is transmitted to the processor 2011 to determine the type of the touch event, and the processor 2011 then A corresponding visual output is provided on the display panel 2020.
本领域技术人员可以理解,本发明实施例中设备200可以包括比图示更多或更少的部件,或者组合某些部件,或者不同的部件布置。例如,设备200还可以进一步包括通信模块、摄像头等,在此不再赘述。Those skilled in the art can understand that the device 200 in the embodiment of the present invention may include more or fewer components than shown, or some components may be combined, or different components may be arranged. For example, the device 200 may further include a communication module, a camera, and the like, and details are not described herein again.
具体的,处理器2011可通过运行或执行存储在存储器2012内的软件程序(指令)、以及调用存储在存储器2012内的数据来实现本发明实施例的语音合成方法,包括:处理器2011根据用户的当前输入语音确定所述用户的身份;根据所述当前输入语音从所述声学模型库中获得声学模型,所述声学模型的预设信息包括预设音速、预设音量、预设音高、预设音色、预设语调和预设韵律节奏中的两个或两个以上;根据所述用户的身份从所述语音合成参数库中确定基础语音合成信息,所述基础语音合成信息包括所述预设音速、所述预设音量和所述预设音高中的一个或多个的变化量;根据所述当前输入语音确定回复文本;根据所述回复文本、上下文信息从所述语音合成参数库中确定强化语音合成信息,所述强化语音合成信息包括所述预设音色、所述预设语调和所述预设韵律节奏中的一个或多个的 变化量;通过所述声学模型,根据所述基础语音合成信息和所述强化语音合成信息对所述回复文本进行语音合成。Specifically, the processor 2011 may implement the speech synthesis method of the embodiment of the present invention by running or executing a software program (instruction) stored in the memory 2012 and calling data stored in the memory 2012, including: the processor 2011 according to a user The current input voice to determine the identity of the user; obtain an acoustic model from the acoustic model library according to the current input voice, and the preset information of the acoustic model includes a preset sound speed, a preset volume, a preset pitch, Two or more of a preset tone color, a preset intonation, and a preset prosody rhythm; determining basic speech synthesis information from the speech synthesis parameter database according to the identity of the user, the basic speech synthesis information including the The amount of change in one or more of the preset sound speed, the preset volume, and the preset pitch; determining a reply text based on the current input voice; and synthesizing a parameter database from the speech based on the reply text and context information Determining enhanced speech synthesis information including the preset tone color, the preset intonation, and the preset rhyme One or more of the amount of change of rhythm; through the acoustic model of the speech synthesis, a speech synthesis information base and the reinforcing speech synthesis information according to the text of the reply.
处理器2011执行语音合成方法的具体实施过程可参考前文的各个方法实施例,这里不再赘述。For the specific implementation process of the speech synthesis method performed by the processor 2011, reference may be made to the foregoing method embodiments, and details are not described herein again.
需要说明的是,在可能的实现方式中,当图4或图9实施例中的模块为软件模块时,存储器2012可还用于存储这些软件模块,处理器2011可用于存储器2012内的软件程序(指令)和/或这些软件模块,以及调用存储在存储器2012内的数据来执行语音合成。It should be noted that, in a possible implementation manner, when the modules in the embodiments of FIG. 4 or FIG. 9 are software modules, the memory 2012 may further be used to store these software modules, and the processor 2011 may be used for the software programs in the memory 2012 (Instructions) and / or these software modules, and calling data stored in the memory 2012 to perform speech synthesis.
还需要说明的是,虽然图31仅仅是本发明语音合成设备的一种实现方式,所述设备200中处理器2011和存储器2012,在可能的实施例中,还可以是集成部署的。It should also be noted that although FIG. 31 is only an implementation manner of the speech synthesis device of the present invention, the processor 2011 and the memory 2012 in the device 200 may be integratedly deployed in a possible embodiment.
在上述实施例中,可以全部或部分地通过软件、硬件、固件或者任意组合来实现。当使用软件实现时,可以全部或者部分地以计算机程序产品的形式实现。所述计算机程序产品包括一个或多个计算机指令,在计算机上加载和执行所述计算机程序指令时,全部或部分地产生按照本发明实施例所述的流程或功能。所述计算机可以是通用计算机、专用计算机、计算机网络或其他可编程装置。所述计算机指令可存储在计算机可读存储介质中,或者从一个计算机可读存储介质向另一个计算机可读存储介质传输,例如,所述计算机指令可以从一个网络站点、计算机、服务器或数据中心通过有线(例如同轴电缆、光纤、数字用户线)或无线(例如红外、微波等)方式向另一个网络站点、计算机、服务器或数据中心进行传输。所述计算机可读存储介质可以是计算机能够存取的任何可用介质,也可以是包含一个或多个可用介质集成的服务器、数据中心等数据存储设备。所述可用介质可以是磁性介质(例如软盘、硬盘、磁带等)、光介质(例如DVD等)、或者半导体介质(例如固态硬盘)等等。In the above embodiments, it may be implemented in whole or in part by software, hardware, firmware, or any combination. When implemented in software, it may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When the computer program instructions are loaded and executed on a computer, the processes or functions according to the embodiments of the present invention are wholly or partially generated. The computer may be a general-purpose computer, a special-purpose computer, a computer network, or other programmable device. The computer instructions may be stored in a computer-readable storage medium, or transmitted from one computer-readable storage medium to another computer-readable storage medium. For example, the computer instructions may be from a network site, computer, server, or data center. Transmission to another network site, computer, server or data center via wired (such as coaxial cable, optical fiber, digital subscriber line) or wireless (such as infrared, microwave, etc.). The computer-readable storage medium may be any available medium that can be accessed by a computer, and may also be a data storage device such as a server, a data center, and the like that includes one or more available medium integration. The usable medium may be a magnetic medium (such as a floppy disk, a hard disk, a magnetic tape, etc.), an optical medium (such as a DVD, etc.), or a semiconductor medium (such as a solid state hard disk), and the like.
在上述实施例中,对各个实施例的描述各有侧重,某个实施例中没有详述的部分,可以参见其他实施例的相关描述。In the above embodiments, the description of each embodiment has its own emphasis. For a part that is not described in detail in an embodiment, reference may be made to related descriptions in other embodiments.

Claims (18)

  1. 一种语音合成方法,其特征在于,所述方法包括:A speech synthesis method, characterized in that the method includes:
    根据用户的当前输入语音确定所述用户的身份;Determining the identity of the user according to the user's current input voice;
    根据所述当前输入语音从预设的声学模型库中获得声学模型,所述声学模型的预设信息包括预设音速、预设音量、预设音高、预设音色、预设语调和预设韵律节奏中的多个;Obtaining an acoustic model from a preset acoustic model library according to the current input voice, the preset information of the acoustic model includes a preset sound speed, a preset volume, a preset pitch, a preset tone color, a preset tone, and a preset Multiple in prosody
    根据所述用户的身份确定基础语音合成信息,所述基础语音合成信息包括所述预设音速、所述预设音量和所述预设音高中的一个或多个的变化量;Determining basic speech synthesis information according to the identity of the user, the basic speech synthesis information including a change amount of one or more of the preset sound speed, the preset volume, and the preset pitch;
    根据所述当前输入语音确定回复文本;Determining a reply text according to the current input voice;
    根据所述回复文本、所述当前输入语音的上下文信息确定强化语音合成信息,所述强化语音合成信息包括所述预设音色、所述预设语调和所述预设韵律节奏中的一个或多个的变化量;Determining enhanced speech synthesis information according to the response text and context information of the currently input speech, the enhanced speech synthesis information including one or more of the preset tone color, the preset tone, and the preset prosody Amount of change
    通过所述声学模型,根据所述基础语音合成信息和所述强化语音合成信息对所述回复文本进行语音合成。According to the acoustic model, speech synthesis is performed on the reply text according to the basic speech synthesis information and the enhanced speech synthesis information.
  2. 根据权利要求1所述的方法,其特征在于,所述根据所述回复文本、上下文信息确定强化语音合成信息,包括:The method according to claim 1, wherein determining the enhanced speech synthesis information according to the reply text and context information comprises:
    根据所述回复文本确定所述回复文本的文学样式特征,所述文学样式特征包括所述回复文本中的部分或全部内容的句子个数、每句字数和句子字数的排列顺序中的一个或多个;Determine the literary style characteristics of the reply text according to the reply text, the literary style features including one or more of the number of sentences, the number of words per sentence, and the arrangement order of the number of words in the reply text A
    根据所述回复文本的文学样式特征选取对应的预设韵律节奏的变化量;其中,所述文学样式特征与所述预设韵律节奏的变化量之间具有对应关系,所述预设韵律节奏的变化量表示所述回复文本的部分或全部内容中的字符的朗读时长、朗读停顿位置、朗读停顿时间、重音各自的变化。Selecting a corresponding change amount of the preset rhythm according to the literary style feature of the reply text; wherein there is a corresponding relationship between the literary style feature and the change amount of the preset prosody rhythm, The amount of change indicates the respective changes in the reading duration, reading pause position, reading pause time, and accent of characters in part or all of the reply text.
  3. 根据权利要求1或2所述的方法,其特征在于,所选取的声学模型的所述预设信息还包括语言风格特征,所述语言风格特征具体包括口头禅、对特定场景的应答方式、智慧类型、性格类型、夹杂的流行语言或方言、对特定人物的称谓中的一个或多个。The method according to claim 1 or 2, wherein the preset information of the selected acoustic model further includes a language style feature, and the language style feature specifically includes a mantra, a response mode to a specific scene, and a type of wisdom , One or more of personality type, mixed popular language or dialect, or appellation of a particular person.
  4. 根据权利要求1-3任一项所述的方法,其特征在于,所述声学模型库中的声学模型有多个;所述根据所述当前输入语音从预设的声学模型库中获得声学模型,包括:The method according to any one of claims 1 to 3, wherein there are multiple acoustic models in the acoustic model library; and obtaining an acoustic model from a preset acoustic model library according to the current input voice ,include:
    根据所述用户的身份确定所述用户的喜好;Determining preferences of the user according to the identity of the user;
    根据所述用户的喜好从所述声学模型库中选取声学模型。An acoustic model is selected from the acoustic model library according to the user's preference.
  5. 根据权利要求1-3任一项所述的方法,其特征在于,所述声学模型库中的声学模型有多个,每个声学模型分别具有一个声模标识;所述根据所述当前输入语音从预设的声学模型库中获得声学模型包括:The method according to any one of claims 1-3, wherein there are a plurality of acoustic models in the acoustic model library, and each acoustic model has an acoustic mode identifier respectively; and Acoustic models obtained from preset acoustic model libraries include:
    根据所述当前输入语音的内容,确定与所述当前输入语音的内容相关的声模标识;Determining a sound mode identifier related to the content of the currently inputted voice according to the content of the currently inputted voice;
    从所述声学模型库中选取对应于所述声模标识的声学模型。An acoustic model corresponding to the acoustic mode identification is selected from the acoustic model library.
  6. 根据权利要求1-3任一项所述的方法,其特征在于,所述声学模型库中的声学模型有多个;The method according to any one of claims 1 to 3, wherein there are multiple acoustic models in the acoustic model library;
    所述根据所述当前输入语音从预设的声学模型库中获得声学模型包括:The obtaining an acoustic model from a preset acoustic model library according to the current input voice includes:
    根据所述用户的身份选取所述声学模型中的多个声学模型;Selecting multiple acoustic models in the acoustic model according to the identity of the user;
    确定所述多个声学模型中的各个声学模型的权重值;其中,所述各个声学模型的权重值是用户预先设置的,或者,所述各个声学模型的权重值是预先根据所述用户的喜好而确定的;Determining a weight value of each of the plurality of acoustic models; wherein the weight value of each of the acoustic models is preset by a user, or the weight value of each of the acoustic models is based on a preference of the user in advance Determined
    将所述各个声学模型基于所述权重值进行融合,获得融合后的声学模型。The acoustic models are fused based on the weight values to obtain a fused acoustic model.
  7. 根据权利要求1-6任一项所述的方法,其特征在于,所述根据用户的当前输入语音确定所述用户的身份之前,还包括:The method according to any one of claims 1-6, wherein before determining the identity of the user according to the user's current input voice, further comprising:
    根据所述用户的历史输入语音确定目标字符与用户偏好读音之间的对应关系,将所述目标字符与用户偏好读音之间的对应关系关联所述用户的身份;Determining the correspondence between the target character and the user's preferred pronunciation according to the user's historical input voice, and associating the correspondence between the target character and the user's preferred pronunciation with the identity of the user;
    相应的,所述通过所述声学模型,根据所述基础语音合成信息和所述强化语音合成信息对所述回复文本进行语音合成,包括:Accordingly, the using the acoustic model to perform speech synthesis on the reply text according to the basic speech synthesis information and the enhanced speech synthesis information includes:
    当所述回复文本中存在与所述用户的身份关联的所述目标字符时,通过所述声学模型,根据所述目标字符与用户偏好读音之间的对应关系、所述基础语音合成信息和所述强化语音合成信息对所述回复文本进行语音合成。When the target character associated with the identity of the user is present in the reply text, the acoustic model is used according to the correspondence between the target character and the user's preferred pronunciation, the basic speech synthesis information, and the The enhanced speech synthesis information performs speech synthesis on the reply text.
  8. 根据权利要求1至7任一项所述的方法,其特征在于,所述方法还包括:根据所述回复文本从预设的音乐库中选取背景音效,所述背景音效为音乐或声音特效;The method according to any one of claims 1 to 7, further comprising: selecting a background sound effect from a preset music library according to the reply text, the background sound effect being music or a sound special effect;
    相应的,所述通过所述声学模型,根据所述基础语音合成信息和所述强化语音合成信息对所述回复文本进行语音合成,包括:Accordingly, the using the acoustic model to perform speech synthesis on the reply text according to the basic speech synthesis information and the enhanced speech synthesis information includes:
    通过所述声学模型,根据所述背景音效、所述基础语音合成信息和所述强化语音合成信息对所述回复文本进行语音合成。According to the acoustic model, speech synthesis is performed on the reply text according to the background sound effect, the basic speech synthesis information, and the enhanced speech synthesis information.
  9. 根据权利要求8所述的方法,其特征在于,所述背景音效具有一个或多个情感极性类型的标识和情感强度的标识;所述情感极性类型的标识用于指示以下至少一种情感:快乐、喜欢、悲伤、惊讶、愤怒、恐惧、厌恶;所述情感强度的标识用于指示所述至少一种情感各自的程度值;The method according to claim 8, wherein the background sound effect has one or more identifications of an emotional polarity type and an identification of an emotional intensity; the identifications of the emotional polarity type are used to indicate at least one of the following emotions : Happiness, like, sadness, surprise, anger, fear, disgust; the identifier of the emotion intensity is used to indicate the respective degree value of the at least one emotion;
    所述根据所述回复文本从预设的音乐库中选取背景音效,包括:The selecting a background sound effect from a preset music library according to the reply text includes:
    将所述回复文本的内容拆分成多个子内容,分别确定各个子内容的情感极性类型和情感强度;Split the content of the reply text into multiple sub-contents, and determine the emotional polarity type and emotional intensity of each sub-content;
    根据所述各个子内容的情感极性类型和情感强度,在所述预设的音乐库中选取最匹配的背景音效;Selecting the most suitable background sound effect from the preset music library according to the emotional polarity type and the emotional intensity of each sub-content;
    其中,所述最匹配的背景音效包括多个子片段,各个子片段分别具有情感极性类型的标识和情感强度的标识,所述各个子片段具有的情感极性类型的标识所指示的情感极性类 型分别与所述各个子内容的情感极性类型相同,且所述各个子片段具有的情感强度的标识所指示的情感强度之间的变化趋势与所述各个子内容的情感强度之间的变化趋势相一致。The best-matching background sound effect includes multiple sub-segments, each of which has an identification of an emotional polarity type and an identification of an emotional intensity, and the emotional polarity indicated by the identification of the emotional polarity type of each sub-segment The type is the same as the emotional polarity type of each sub-content, and the change between the emotional intensity indicated by the identifier of the emotional intensity of each sub-segment and the change between the emotional intensity of each sub-content The trends are consistent.
  10. 一种语音合成设备,其特征在于,所述语音合成设备包括:A speech synthesis device, characterized in that the speech synthesis device includes:
    语音识别模块,用于接收用户的当前输入语音;A voice recognition module for receiving a user's current input voice;
    语音对话模块,用于根据用户的当前输入语音确定所述用户的身份;根据所述用户的身份确定基础语音合成信息;根据所述当前输入语音确定回复文本;根据所述回复文本、所述当前输入语音的上下文信息确定强化语音合成信息;A voice dialogue module, configured to determine the identity of the user based on the user's current input voice; determine the basic speech synthesis information based on the user's identity; determine the reply text based on the current input voice; based on the reply text, the current The context information of the input speech determines the enhanced speech synthesis information;
    语音合成模块,用于根据所述当前输入语音从预设的声学模型库中获得声学模型,所述声学模型的预设信息包括预设音速、预设音量、预设音高、预设音色、预设语调和预设韵律节奏中的多个;通过所述声学模型,根据所述基础语音合成信息和所述强化语音合成信息对所述回复文本进行语音合成;A speech synthesis module, configured to obtain an acoustic model from a preset acoustic model library according to the current input voice, the preset information of the acoustic model includes a preset sound speed, a preset volume, a preset pitch, a preset tone color, A plurality of preset tones and preset prosody; performing speech synthesis on the reply text according to the basic speech synthesis information and the enhanced speech synthesis information through the acoustic model;
    其中,所述基础语音合成信息包括所述声学模型的预设信息的所述预设音速、所述预设音量和所述预设音高中的一个或多个的变化量;所述强化语音合成信息包括所述声学模型的的预设信息的所述预设音色、所述预设语调和所述预设韵律节奏中的一个或多个的变化量。Wherein, the basic speech synthesis information includes a change amount of one or more of the preset sound speed, the preset volume, and the preset pitch of the preset information of the acoustic model; the enhanced speech synthesis The information includes a change amount of one or more of the preset tone color, the preset tone, and the preset prosody rhythm of the preset information of the acoustic model.
  11. 根据权利要求10所述的设备,其特征在于,所述语音对话模块具体用于:The device according to claim 10, wherein the voice dialog module is specifically configured to:
    根据所述回复文本来确定所述回复文本的文学样式特征,所述文学样式特征包括所述回复文本中的部分或全部内容的句子个数、每句字数和句子字数的排列顺序中的一个或多个;Determining the literary style characteristics of the reply text according to the reply text, the literary style features including one or more of the number of sentences, the number of words per sentence, and the order of the number of sentences in the reply text Multiple
    根据所述回复文本的文学样式特征选取对应的预设韵律节奏的变化量;其中,所述文学样式特征与所述预设韵律节奏的变化量之间具有对应关系,所述预设韵律节奏的变化量表示所述回复文本的部分或全部内容中的字符的朗读时长、朗读停顿位置、朗读停顿时间、重音各自的变化。Selecting a corresponding change amount of the preset rhythm according to the literary style feature of the reply text; wherein there is a corresponding relationship between the literary style feature and the change amount of the preset prosody rhythm, The amount of change indicates the respective changes in the reading duration, reading pause position, reading pause time, and accent of characters in part or all of the reply text.
  12. 根据权利要求10或11所述的设备,其特征在于,所选取的声学模型的所述预设信息还包括语言风格特征,所述语言风格特征具体包括口头禅、对特定场景的应答方式、智慧类型、性格类型、夹杂的流行语言或方言、对特定人物的称谓中的一个或多个。The device according to claim 10 or 11, wherein the preset information of the selected acoustic model further includes a language style feature, and the language style feature specifically includes a mantra, a response mode to a specific scene, and a type of wisdom , One or more of personality type, mixed popular language or dialect, or appellation of a particular person.
  13. 根据权利要求10-12任一项所述的设备,其特征在于,所述声学模型库中的声学模型有多个;所述语音合成模块具体用于:The device according to any one of claims 10-12, wherein there are multiple acoustic models in the acoustic model library; and the speech synthesis module is specifically configured to:
    根据所述用户的身份确定所述用户的喜好;根据所述用户的喜好从所述声学模型库中选取声学模型。The preference of the user is determined according to the identity of the user; an acoustic model is selected from the acoustic model library according to the preference of the user.
  14. 根据权利要求10-12任一项所述的设备,其特征在于,所述声学模型库中的声学模型有多个,每个声学模型分别具有一个声模标识;所述语音合成模块具体用于:The device according to any one of claims 10-12, wherein there are multiple acoustic models in the acoustic model library, and each acoustic model has an acoustic mode identifier respectively; and the voice synthesis module is specifically configured to: :
    根据所述当前输入语音的内容,确定与所述当前输入语音的内容相关的声模标识;从 所述声学模型库中选取对应于所述声模标识的声学模型。Determining an acoustic mode identifier related to the content of the current input voice according to the content of the current input voice; and selecting an acoustic model corresponding to the acoustic mode identifier from the acoustic model library.
  15. 根据权利要求10-12任一项所述的设备,其特征在于,所述声学模型库中的声学模型有多个;所述语音合成模块具体用于:The device according to any one of claims 10-12, wherein there are multiple acoustic models in the acoustic model library; and the speech synthesis module is specifically configured to:
    根据所述用户的身份选取所述声学模型中的多个声学模型;确定所述多个声学模型中的各个声学模型的权重值;其中,所述各个声学模型的权重值是用户预先设置的,或者,所述各个声学模型的权重值是预先根据所述用户的喜好而确定的;将所述各个声学模型基于所述权重值进行融合,获得融合后的声学模型。Selecting a plurality of acoustic models in the acoustic model according to the identity of the user; determining a weight value of each acoustic model in the plurality of acoustic models; wherein the weight value of each acoustic model is preset by the user, Alternatively, the weight value of each acoustic model is determined in advance according to the preference of the user; the respective acoustic models are fused based on the weight values to obtain a fused acoustic model.
  16. 根据权利要求10-15任一项所述的设备,其特征在于,The device according to any one of claims 10-15, characterized in that:
    所述语音对话模块还用于:在所述语音识别模块接收用户的当前输入语音之前,根据所述用户的历史输入语音确定目标字符与用户偏好读音之间的对应关系,将所述目标字符与用户偏好读音之间的对应关系关联所述用户的身份;The voice dialogue module is further configured to: before the voice recognition module receives the user's current input voice, determine the correspondence between the target character and the user's preferred pronunciation according to the user's historical input voice, and compare the target character with the user's preferred pronunciation. The correspondence between the user's preferred pronunciations is associated with the identity of the user;
    所述语音合成模块具体用于:当所述回复文本中存在与所述用户的身份关联的所述目标字符时,通过所述声学模型,根据所述目标字符与用户偏好读音之间的对应关系、所述基础语音合成信息和所述强化语音合成信息对所述回复文本进行语音合成。The speech synthesis module is specifically configured to: when the target character associated with the identity of the user exists in the reply text, according to the correspondence between the target character and the user's preferred pronunciation through the acoustic model , The basic speech synthesis information and the enhanced speech synthesis information perform speech synthesis on the reply text.
  17. 根据权利要求10至16任一项所述的设备,其特征在于,The device according to any one of claims 10 to 16, characterized in that:
    所述语音对话模块还用于:根据所述回复文本从预设的音乐库中选取背景音效,所述背景音效为音乐或声音特效;The voice dialogue module is further configured to select a background sound effect from a preset music library according to the reply text, where the background sound effect is music or a sound special effect;
    所述语音合成模块具体用于:通过所述声学模型,根据所述背景音效、所述基础语音合成信息和所述强化语音合成信息对所述回复文本进行语音合成。The speech synthesis module is specifically configured to perform speech synthesis on the reply text according to the background sound effect, the basic speech synthesis information, and the enhanced speech synthesis information through the acoustic model.
  18. 根据权利要求17所述的设备,其特征在于,所述背景音效具有一个或多个情感极性类型的标识和情感强度的标识;所述情感极性类型的标识用于指示以下至少一种情感:快乐、喜欢、悲伤、惊讶、愤怒、恐惧、厌恶;所述情感强度的标识用于指示所述至少一种情感各自的程度值;The device according to claim 17, wherein the background sound effect has one or more identifications of an emotional polarity type and an identification of an emotional intensity; the identifications of the emotional polarity type are used to indicate at least one of the following emotions : Happiness, like, sadness, surprise, anger, fear, disgust; the identifier of the emotion intensity is used to indicate the respective degree value of the at least one emotion;
    所述语音对话模块具体用于:将所述回复文本的内容拆分成多个子内容,分别确定各个子内容的情感极性类型和情感强度;根据所述各个子内容的情感极性类型和情感强度,在所述预设的音乐库中选取最匹配的背景音效;The voice dialog module is specifically configured to: split the content of the reply text into multiple sub-contents, and determine the emotional polarity type and emotional intensity of each sub-content respectively; and according to the emotional polarity types and emotions of each sub-content Intensity, select the most matching background sound effect from the preset music library;
    其中,所述最匹配的背景音效包括多个子片段,各个子片段分别具有情感极性类型的标识和情感强度的标识,所述各个子片段具有的情感极性类型的标识所指示的情感极性类型分别与所述各个子内容的情感极性类型相同,且所述各个子片段具有的情感强度的标识所指示的情感强度之间的变化趋势与所述各个子内容的情感强度之间的变化趋势相一致。The best-matching background sound effect includes multiple sub-segments, each of which has an identification of an emotional polarity type and an identification of an emotional intensity, and the emotional polarity indicated by the identification of the emotional polarity type of each sub-segment The type is the same as the emotional polarity type of each sub-content, and the change between the emotional intensity indicated by the identifier of the emotional intensity of each sub-segment and the change between the emotional intensity of each sub-content The trends are consistent.
PCT/CN2019/076552 2018-07-28 2019-02-28 Speech synthesis method and related device WO2020024582A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201810857240.1 2018-07-28
CN201810857240.1A CN108962217B (en) 2018-07-28 2018-07-28 Speech synthesis method and related equipment

Publications (1)

Publication Number Publication Date
WO2020024582A1 true WO2020024582A1 (en) 2020-02-06

Family

ID=64466758

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/076552 WO2020024582A1 (en) 2018-07-28 2019-02-28 Speech synthesis method and related device

Country Status (2)

Country Link
CN (1) CN108962217B (en)
WO (1) WO2020024582A1 (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2022043712A1 (en) * 2020-08-28 2022-03-03 Sonantic Limited A text-to-speech synthesis method and system, and a method of training a text-to-speech synthesis system
EP4116839A4 (en) * 2020-03-27 2023-03-22 Huawei Technologies Co., Ltd. Voice interaction method and electronic device
EP4102397A4 (en) * 2020-02-03 2023-06-28 Huawei Technologies Co., Ltd. Text information processing method and apparatus, computer device, and readable storage medium

Families Citing this family (68)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108962217B (en) * 2018-07-28 2021-07-16 华为技术有限公司 Speech synthesis method and related equipment
CN109461448A (en) * 2018-12-11 2019-03-12 百度在线网络技术(北京)有限公司 Voice interactive method and device
CN109829039B (en) * 2018-12-13 2023-06-09 平安科技(深圳)有限公司 Intelligent chat method, intelligent chat device, computer equipment and storage medium
CN109523986B (en) * 2018-12-20 2022-03-08 百度在线网络技术(北京)有限公司 Speech synthesis method, apparatus, device and storage medium
CN109524000A (en) * 2018-12-28 2019-03-26 苏州思必驰信息科技有限公司 Offline implementation method and device
CN111399629B (en) * 2018-12-29 2022-05-03 Tcl科技集团股份有限公司 Operation guiding method of terminal equipment, terminal equipment and storage medium
CN109903748A (en) * 2019-02-14 2019-06-18 平安科技(深圳)有限公司 A kind of phoneme synthesizing method and device based on customized sound bank
CN111627417B (en) * 2019-02-26 2023-08-08 北京地平线机器人技术研发有限公司 Voice playing method and device and electronic equipment
CN109977202A (en) * 2019-03-06 2019-07-05 北京西屋信维科技发展有限公司 A kind of intelligent customer service system and its control method
CN110136688B (en) * 2019-04-15 2023-09-29 平安科技(深圳)有限公司 Text-to-speech method based on speech synthesis and related equipment
CN110060656B (en) * 2019-05-05 2021-12-10 标贝(北京)科技有限公司 Model management and speech synthesis method, device and system and storage medium
CN110211564A (en) * 2019-05-29 2019-09-06 泰康保险集团股份有限公司 Phoneme synthesizing method and device, electronic equipment and computer-readable medium
CN110189742B (en) * 2019-05-30 2021-10-08 芋头科技(杭州)有限公司 Method and related device for determining emotion audio frequency, emotion display and text-to-speech
CN110134250B (en) * 2019-06-21 2022-05-31 易念科技(深圳)有限公司 Human-computer interaction signal processing method, device and computer readable storage medium
CN110197655B (en) * 2019-06-28 2020-12-04 百度在线网络技术(北京)有限公司 Method and apparatus for synthesizing speech
CN112331193A (en) * 2019-07-17 2021-02-05 华为技术有限公司 Voice interaction method and related device
CN112242132B (en) * 2019-07-18 2024-06-14 阿里巴巴集团控股有限公司 Data labeling method, device and system in voice synthesis
CN110265021A (en) * 2019-07-22 2019-09-20 深圳前海微众银行股份有限公司 Personalized speech exchange method, robot terminal, device and readable storage medium storing program for executing
CN112417201A (en) * 2019-08-22 2021-02-26 北京峰趣互联网信息服务有限公司 Audio information pushing method and system, electronic equipment and computer readable medium
CN110600001A (en) * 2019-09-09 2019-12-20 大唐网络有限公司 Voice generation method and device
CN110610720B (en) * 2019-09-19 2022-02-25 北京搜狗科技发展有限公司 Data processing method and device and data processing device
CN110782918B (en) * 2019-10-12 2024-02-20 腾讯科技(深圳)有限公司 Speech prosody assessment method and device based on artificial intelligence
CN112765971B (en) * 2019-11-05 2023-11-17 北京火山引擎科技有限公司 Text-to-speech conversion method and device, electronic equipment and storage medium
CN110933330A (en) * 2019-12-09 2020-03-27 广州酷狗计算机科技有限公司 Video dubbing method and device, computer equipment and computer-readable storage medium
CN111031386B (en) * 2019-12-17 2021-07-30 腾讯科技(深圳)有限公司 Video dubbing method and device based on voice synthesis, computer equipment and medium
CN111081244B (en) * 2019-12-23 2022-08-16 广州小鹏汽车科技有限公司 Voice interaction method and device
CN111276122B (en) * 2020-01-14 2023-10-27 广州酷狗计算机科技有限公司 Audio generation method and device and storage medium
CN111292720B (en) * 2020-02-07 2024-01-23 北京字节跳动网络技术有限公司 Speech synthesis method, device, computer readable medium and electronic equipment
CN111241308B (en) * 2020-02-27 2024-04-26 曾兴 Self-help learning method and system for spoken language
CN111445900A (en) * 2020-03-11 2020-07-24 平安科技(深圳)有限公司 Front-end processing method and device for voice recognition and terminal equipment
CN111415650A (en) * 2020-03-25 2020-07-14 广州酷狗计算机科技有限公司 Text-to-speech method, device, equipment and storage medium
CN113539230A (en) * 2020-03-31 2021-10-22 北京奔影网络科技有限公司 Speech synthesis method and device
CN111862938A (en) * 2020-05-07 2020-10-30 北京嘀嘀无限科技发展有限公司 Intelligent response method, terminal and computer readable storage medium
CN113793590A (en) * 2020-05-26 2021-12-14 华为技术有限公司 Speech synthesis method and device
CN113763920B (en) * 2020-05-29 2023-09-08 广东美的制冷设备有限公司 Air conditioner, voice generating method thereof, voice generating device and readable storage medium
CN111696518A (en) * 2020-06-05 2020-09-22 四川纵横六合科技股份有限公司 Automatic speech synthesis method based on text
CN111768755A (en) * 2020-06-24 2020-10-13 华人运通(上海)云计算科技有限公司 Information processing method, information processing apparatus, vehicle, and computer storage medium
CN111916054B (en) * 2020-07-08 2024-04-26 标贝(青岛)科技有限公司 Lip-based voice generation method, device and system and storage medium
CN113763921B (en) * 2020-07-24 2024-06-18 北京沃东天骏信息技术有限公司 Method and device for correcting text
CN111805558B (en) * 2020-08-03 2021-10-08 深圳作为科技有限公司 Self-learning type elderly nursing robot system with memory recognition function
CN111973178A (en) * 2020-08-14 2020-11-24 中国科学院上海微系统与信息技术研究所 Electroencephalogram signal identification system and method
CN112037793A (en) * 2020-08-21 2020-12-04 北京如影智能科技有限公司 Voice reply method and device
CN112148846A (en) * 2020-08-25 2020-12-29 北京来也网络科技有限公司 Reply voice determination method, device, equipment and storage medium combining RPA and AI
CN111968619A (en) * 2020-08-26 2020-11-20 四川长虹电器股份有限公司 Method and device for controlling voice synthesis pronunciation
CN112116905B (en) * 2020-09-16 2023-04-07 珠海格力电器股份有限公司 Method and device for converting memo information into alarm clock to play
CN111930900B (en) * 2020-09-28 2021-09-21 北京世纪好未来教育科技有限公司 Standard pronunciation generating method and related device
CN112185338B (en) * 2020-09-30 2024-01-23 北京大米科技有限公司 Audio processing method, device, readable storage medium and electronic equipment
CN112349271A (en) * 2020-11-06 2021-02-09 北京乐学帮网络技术有限公司 Voice information processing method and device, electronic equipment and storage medium
CN112382287A (en) * 2020-11-11 2021-02-19 北京百度网讯科技有限公司 Voice interaction method and device, electronic equipment and storage medium
CN112071300B (en) * 2020-11-12 2021-04-06 深圳追一科技有限公司 Voice conversation method, device, computer equipment and storage medium
TWI768589B (en) * 2020-12-10 2022-06-21 國立勤益科技大學 Deep learning rhythm practice system
CN112599113B (en) * 2020-12-30 2024-01-30 北京大米科技有限公司 Dialect voice synthesis method, device, electronic equipment and readable storage medium
CN113053373A (en) * 2021-02-26 2021-06-29 上海声通信息科技股份有限公司 Intelligent vehicle-mounted voice interaction system supporting voice cloning
CN113066473A (en) * 2021-03-31 2021-07-02 建信金融科技有限责任公司 Voice synthesis method and device, storage medium and electronic equipment
CN113112987B (en) * 2021-04-14 2024-05-03 北京地平线信息技术有限公司 Speech synthesis method, training method and device of speech synthesis model
CN114999438B (en) * 2021-05-08 2023-08-15 中移互联网有限公司 Audio playing method and device
CN112989103A (en) * 2021-05-20 2021-06-18 广州朗国电子科技有限公司 Message playing method, device and storage medium
CN112992118B (en) * 2021-05-22 2021-07-23 成都启英泰伦科技有限公司 Speech model training and synthesizing method with few linguistic data
CN113096638B (en) * 2021-06-09 2021-09-07 北京世纪好未来教育科技有限公司 Speech synthesis model training method, speech synthesis method and device
CN113838451B (en) * 2021-08-17 2022-09-23 北京百度网讯科技有限公司 Voice processing and model training method, device, equipment and storage medium
CN113851106B (en) * 2021-08-17 2023-01-06 北京百度网讯科技有限公司 Audio playing method and device, electronic equipment and readable storage medium
CN113724687B (en) * 2021-08-30 2024-04-16 深圳市神经科学研究院 Speech generation method, device, terminal and storage medium based on brain electrical signals
CN114189587A (en) * 2021-11-10 2022-03-15 阿里巴巴(中国)有限公司 Call method, device, storage medium and computer program product
CN114373445B (en) * 2021-12-23 2022-10-25 北京百度网讯科技有限公司 Voice generation method and device, electronic equipment and storage medium
CN114678006B (en) * 2022-05-30 2022-08-23 广东电网有限责任公司佛山供电局 Rhythm-based voice synthesis method and system
CN114822495B (en) * 2022-06-29 2022-10-14 杭州同花顺数据开发有限公司 Acoustic model training method and device and speech synthesis method
CN117059082B (en) * 2023-10-13 2023-12-29 北京水滴科技集团有限公司 Outbound call conversation method, device, medium and computer equipment based on large model
CN117153162B (en) * 2023-11-01 2024-05-24 北京中电慧声科技有限公司 Voice privacy protection method and device

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102402982A (en) * 2010-09-14 2012-04-04 盛乐信息技术(上海)有限公司 Loud reading system with selectable background sounds and realization method of system
JP5112978B2 (en) * 2008-07-30 2013-01-09 Kddi株式会社 Speech recognition apparatus, speech recognition system, and program
CN104485100A (en) * 2014-12-18 2015-04-01 天津讯飞信息科技有限公司 Text-to-speech pronunciation person self-adaptive method and system
CN104766603A (en) * 2014-01-06 2015-07-08 安徽科大讯飞信息科技股份有限公司 Method and device for building personalized singing style spectrum synthesis model
WO2015178600A1 (en) * 2014-05-23 2015-11-26 Samsung Electronics Co., Ltd. Speech recognition method and apparatus using device information
CN105895103A (en) * 2015-12-03 2016-08-24 乐视致新电子科技(天津)有限公司 Speech recognition method and device
CN106663219A (en) * 2014-04-17 2017-05-10 软银机器人欧洲公司 Methods and systems of handling a dialog with a robot
CN106683667A (en) * 2017-01-13 2017-05-17 深圳爱拼信息科技有限公司 Automatic rhythm extracting method, system and application thereof in natural language processing
CN107644643A (en) * 2017-09-27 2018-01-30 安徽硕威智能科技有限公司 A kind of voice interactive system and method
CN108962217A (en) * 2018-07-28 2018-12-07 华为技术有限公司 Phoneme synthesizing method and relevant device

Family Cites Families (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101000764B (en) * 2006-12-18 2011-05-18 黑龙江大学 Speech synthetic text processing method based on rhythm structure
EP2595143B1 (en) * 2011-11-17 2019-04-24 Svox AG Text to speech synthesis for texts with foreign language inclusions
US9911407B2 (en) * 2014-01-14 2018-03-06 Interactive Intelligence Group, Inc. System and method for synthesis of speech from provided text
CN105304080B (en) * 2015-09-22 2019-09-03 科大讯飞股份有限公司 Speech synthetic device and method
CN106550156A (en) * 2017-01-23 2017-03-29 苏州咖啦魔哆信息技术有限公司 A kind of artificial intelligence's customer service system and its implementation based on speech recognition
CN106952648A (en) * 2017-02-17 2017-07-14 北京光年无限科技有限公司 A kind of output intent and robot for robot
CN107731219B (en) * 2017-09-06 2021-07-20 百度在线网络技术(北京)有限公司 Speech synthesis processing method, device and equipment
CN107767869B (en) * 2017-09-26 2021-03-12 百度在线网络技术(北京)有限公司 Method and apparatus for providing voice service
CN107705783B (en) * 2017-11-27 2022-04-26 北京搜狗科技发展有限公司 Voice synthesis method and device
CN107993650A (en) * 2017-11-30 2018-05-04 百度在线网络技术(北京)有限公司 Method and apparatus for generating information

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP5112978B2 (en) * 2008-07-30 2013-01-09 Kddi株式会社 Speech recognition apparatus, speech recognition system, and program
CN102402982A (en) * 2010-09-14 2012-04-04 盛乐信息技术(上海)有限公司 Loud reading system with selectable background sounds and realization method of system
CN104766603A (en) * 2014-01-06 2015-07-08 安徽科大讯飞信息科技股份有限公司 Method and device for building personalized singing style spectrum synthesis model
CN106663219A (en) * 2014-04-17 2017-05-10 软银机器人欧洲公司 Methods and systems of handling a dialog with a robot
WO2015178600A1 (en) * 2014-05-23 2015-11-26 Samsung Electronics Co., Ltd. Speech recognition method and apparatus using device information
CN104485100A (en) * 2014-12-18 2015-04-01 天津讯飞信息科技有限公司 Text-to-speech pronunciation person self-adaptive method and system
CN105895103A (en) * 2015-12-03 2016-08-24 乐视致新电子科技(天津)有限公司 Speech recognition method and device
CN106683667A (en) * 2017-01-13 2017-05-17 深圳爱拼信息科技有限公司 Automatic rhythm extracting method, system and application thereof in natural language processing
CN107644643A (en) * 2017-09-27 2018-01-30 安徽硕威智能科技有限公司 A kind of voice interactive system and method
CN108962217A (en) * 2018-07-28 2018-12-07 华为技术有限公司 Phoneme synthesizing method and relevant device

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP4102397A4 (en) * 2020-02-03 2023-06-28 Huawei Technologies Co., Ltd. Text information processing method and apparatus, computer device, and readable storage medium
EP4116839A4 (en) * 2020-03-27 2023-03-22 Huawei Technologies Co., Ltd. Voice interaction method and electronic device
WO2022043712A1 (en) * 2020-08-28 2022-03-03 Sonantic Limited A text-to-speech synthesis method and system, and a method of training a text-to-speech synthesis system

Also Published As

Publication number Publication date
CN108962217B (en) 2021-07-16
CN108962217A (en) 2018-12-07

Similar Documents

Publication Publication Date Title
WO2020024582A1 (en) Speech synthesis method and related device
US6721706B1 (en) Environment-responsive user interface/entertainment device that simulates personal interaction
US6795808B1 (en) User interface/entertainment device that simulates personal interaction and charges external database with relevant data
US6731307B1 (en) User interface/entertainment device that simulates personal interaction and responds to user&#39;s mental state and/or personality
US6728679B1 (en) Self-updating user interface/entertainment device that simulates personal interaction
US20200395008A1 (en) Personality-Based Conversational Agents and Pragmatic Model, and Related Interfaces and Commercial Models
US20190193273A1 (en) Robots for interactive comedy and companionship
CN108806656B (en) Automatic generation of songs
CN108806655B (en) Automatic generation of songs
CN113010138B (en) Article voice playing method, device and equipment and computer readable storage medium
CN106486121A (en) It is applied to the voice-optimizing method and device of intelligent robot
WO2022242706A1 (en) Multimodal based reactive response generation
CN110781327B (en) Image searching method and device, terminal equipment and storage medium
JP6222465B2 (en) Animation generating apparatus, animation generating method and program
KR20200085433A (en) Voice synthesis system with detachable speaker and method using the same
Narayanan et al. Multimodal systems for children: building a prototype.
Fröhlich Auditory human-computer interaction: An integrated approach
CN117809677A (en) Server, display equipment and digital human interaction method
CN117809680A (en) Server, display equipment and digital human interaction method
CN117809617A (en) Server, display equipment and voice interaction method
CN117809678A (en) Server, display equipment and digital human interaction method
CN117812279A (en) Server, terminal, display equipment and digital human interaction method
CN117111738A (en) Man-machine interaction method, device, equipment and storage medium

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19844016

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19844016

Country of ref document: EP

Kind code of ref document: A1