WO2020098269A1 - 一种语音合成方法及语音合成装置 - Google Patents

一种语音合成方法及语音合成装置 Download PDF

Info

Publication number
WO2020098269A1
WO2020098269A1 PCT/CN2019/091844 CN2019091844W WO2020098269A1 WO 2020098269 A1 WO2020098269 A1 WO 2020098269A1 CN 2019091844 W CN2019091844 W CN 2019091844W WO 2020098269 A1 WO2020098269 A1 WO 2020098269A1
Authority
WO
WIPO (PCT)
Prior art keywords
emotional
target
acoustic
emotion
speech
Prior art date
Application number
PCT/CN2019/091844
Other languages
English (en)
French (fr)
Inventor
邓利群
胡月志
杨占磊
孙文华
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Publication of WO2020098269A1 publication Critical patent/WO2020098269A1/zh
Priority to US16/944,863 priority Critical patent/US11282498B2/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • G10L13/027Concept to speech synthesisers; Generation of natural phrases from machine-based concepts
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • G10L13/04Details of speech synthesis systems, e.g. synthesiser structure or memory management
    • G10L13/047Architecture of speech synthesisers
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • G10L13/10Prosody rules derived from text; Stress or intonation
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/63Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state

Definitions

  • This application relates to the field of artificial intelligence, in particular to a speech synthesis method and speech synthesis device.
  • speech synthesis technology has been widely used in many fields, such as the field of intelligent mobile terminals, the field of smart homes, and the field of vehicle equipment.
  • the speech quality requirements for speech synthesis are becoming higher and higher.
  • the requirements for speech quality are not only "can be heard clearly", but also more similar high-quality requirements such as "highly realistic and emotional”.
  • Embodiments of the present application provide a speech synthesis method and a speech synthesis device, which are used to synthesize emotional speech corresponding to emotion types with different emotional intensities, improve the fidelity of emotional speech, and enhance the richness of emotional speech in emotional expression.
  • an embodiment of the present application provides a speech synthesis method, including: first, obtaining a target emotion type and a target emotion intensity parameter of an input text, where the target emotion intensity parameter is used to characterize the emotion intensity corresponding to the target emotion type ; Second, obtain the target emotion acoustic model corresponding to the target emotion type and the target emotion intensity parameter, and again, input the text features of the input text into the target emotion acoustic model to obtain the acoustic features of the input text; finally, according to the acoustic features of the input text Synthesize target emotional speech.
  • the target emotional acoustic model is an acoustic model obtained by model training on the target emotional type and the emotional intensity corresponding to the target emotional type; and based on the target emotional acoustic model, the acoustic features corresponding to the input text are obtained.
  • the above acoustic features synthesize the corresponding target emotional speech.
  • the technical solution of the present application has the following advantages: through the target emotion acoustic model corresponding to the target emotion type and the emotion intensity of the target emotion type, the text features of the input text are converted into acoustic features, and finally the This acoustic feature synthesizes the target emotional speech. Since the target emotion acoustic model is obtained based on the target emotion type and the emotion intensity of the target emotion type, the target emotion acoustic model can be used to obtain acoustic features related to the emotion intensity of the emotion type to synthesize different emotion types and different emotions Intensity of speech enhances the diversity of synthesized speech in emotional performance.
  • obtaining the target emotion type and target emotion intensity parameter of the input text includes: determining according to the emotion label of the input text The target sentiment type of the input text, where the sentiment tag is used to characterize the sentiment type of the input text; the target sentiment intensity parameter is determined according to the sentiment intensity requirement corresponding to the input text, optionally, the sentiment intensity requirement is specified by the user.
  • the target emotional acoustic model corresponding to the parameters includes: selecting the emotional acoustic model corresponding to the target emotion type and the target emotional intensity parameter from the emotional acoustic model set as the target emotional acoustic model, and the emotional acoustic model set includes multiple emotional acoustic models and multiple emotions Acoustic models include target emotional acoustic models.
  • the method further includes: for different emotional types and different emotional intensity parameters corresponding to the emotional types, the neutral acoustic model and the corresponding emotional speech training data are used for model training to obtain the emotional acoustic model set, the emotional speech
  • the training data is data with emotions corresponding to one or more emotion types.
  • the neutral acoustic model is obtained by using neutral speech training data for model training.
  • the neutral speech training data does not correspond to any emotion type. Emotional data.
  • the acoustic feature training error corresponding to the emotional speech training data and the above target The emotional intensity parameters are related, where the acoustic feature training error corresponding to the emotional speech training data is used to characterize the acoustics between the acoustic features predicted using the emotional speech training data during the acoustic model training process and the original acoustic features of the emotional speech training data Feature loss.
  • the neutral acoustic model and emotion Acoustic models can be constructed based on hidden Markov model or deep neural network model.
  • the text features include: features corresponding to at least one of phonemes, syllables, words, or prosodic phrases corresponding to the text; the above acoustic features include: fundamental frequency features corresponding to sound, line pair features, At least one of the unvoiced and voiced sound signature feature or the spectrum envelope feature.
  • an embodiment of the present application provides a speech synthesis apparatus, which has a function of implementing the method of the first aspect or any possible implementation manner of the first aspect.
  • This function can be realized by hardware, and can also be realized by hardware executing corresponding software.
  • the hardware or software includes one or more modules corresponding to the above functions.
  • an embodiment of the present application provides a speech synthesis apparatus, including: a processor and a memory; the memory is used to store computer-executed instructions, and when the speech synthesis apparatus is running, the processor executes the computer stored in the memory Executing an instruction, so that the performing function network element executes the speech synthesis method as described in the first aspect or any possible implementation manner of the first aspect.
  • an embodiment of the present application provides a computer-readable storage medium, in which instructions are stored in a computer-readable storage medium, which when run on a computer, enables the computer to execute the first aspect or any of the first aspect A possible method of speech synthesis.
  • an embodiment of the present application provides a computer program product containing computer operation instructions, which when run on a computer, enables the computer to perform the speech synthesis of the first aspect or any possible implementation manner of the first aspect method.
  • an embodiment of the present application provides a chip system.
  • the chip system includes a processor for supporting a speech synthesis apparatus to implement the functions involved in the first aspect or any possible implementation manner of the first aspect.
  • the chip system further includes a memory, which is used to store necessary program instructions and data of the control function network element.
  • the chip system may be composed of chips, or may include chips and other discrete devices.
  • an embodiment of the present application provides an emotion model training method.
  • the emotion model training method can be used for the training of the emotional acoustic model in the first aspect described above.
  • the specific training method includes: based on a hidden Markov model or depth
  • the neural network model obtains the emotional acoustic model, and uses the final model parameters of the neutral acoustic model as the initial model parameters of the emotional acoustic model, and inputs the text features corresponding to the emotional type of "happy" and the emotional intensity of "happy” to 0.5 into the above
  • the training error of the acoustic feature corresponding to the text feature is calculated based on the emotional intensity parameter 0.5.
  • the training error is greater than the preset error, iterative calculation is performed until the training error is less than or equal to the preset error.
  • the corresponding model parameter is used as the final model parameter of the "happy" emotional acoustic model to complete the training of the "happy” emotional acoustic model.
  • the "happy" emotional acoustic model corresponding to other emotional intensity parameters (such as 0.1, 0.2, 0.3, 0.4, 0.6, 0.7, 0.8, 0.9, and 1) of the "happy” emotion type can be trained Further, using the above emotion model training method can also obtain emotional acoustic models of other emotion types, such as "sadness” emotional acoustic model, “surprise” emotional acoustic model “and” fear “emotional acoustic model.
  • the emotion acoustic models of emotion types and emotion intensity parameters of emotion types constitute a collection of emotion acoustic models.
  • the training of the neutral acoustic model is similar to the training method of the "happy" emotional acoustic model described above, which may be specifically: a neutral acoustic model is constructed based on a hidden Markov model or a deep neural network model, and the neutral acoustic
  • the model parameters corresponding to each neural network layer in the model are initialized with random values, and then, after the model parameters of the neutral acoustic model are initialized, neutral speech training data that does not carry any emotion type are brought into the neutral acoustic model for Training, and the corresponding model parameter when the training error is less than the preset error is determined as the final model parameter of the neutral acoustic model to complete the training of the neutral acoustic model.
  • an embodiment of the present application further provides an emotion model training device, which has the function of implementing the method of the seventh aspect or any possible implementation manner of the seventh aspect.
  • This function can be realized by hardware, and can also be realized by hardware executing corresponding software.
  • the hardware or software includes one or more modules corresponding to the above functions.
  • FIG. 1 is a schematic diagram of an embodiment of a speech synthesis method provided in an embodiment of this application;
  • FIG. 2 is a schematic diagram of a deep neural network model provided in an embodiment of this application.
  • FIG. 3 is a schematic diagram of an adaptive training process of an emotional acoustic model in an embodiment of the present application
  • FIG. 4 is a schematic diagram of a hardware structure of a speech synthesis apparatus provided in an embodiment of this application.
  • FIG. 5 is a schematic structural diagram of an embodiment of a speech synthesis apparatus provided in an embodiment of the present application.
  • the embodiments of the present application provide a speech synthesis method and a speech synthesis device, which are suitable for synthesizing speech with different emotional intensities, and enhance the diversity of synthesized speech in emotional expression. The details are described below.
  • the naming or numbering of steps that appear in this application does not mean that the steps in the method flow must be executed in the time / logic sequence indicated by the naming or numbering.
  • the named or numbered process steps can be based on the The technical order can be changed as long as the same or similar technical effects can be achieved.
  • the division of modules appearing in this application is a logical division. In actual application, there may be other divisions. For example, multiple modules can be combined or integrated into another system, or some features can be ignored , Or not, in addition, the displayed or discussed mutual coupling or direct coupling or communication connection may be through some interfaces, and the indirect coupling or communication connection between modules may be electrical or other similar forms. There are no restrictions in the application.
  • the modules or submodules described as separate components may or may not be physically separated, may or may not be physical modules, or may be distributed among multiple circuit modules, and some or all of them may be selected according to actual needs Module to achieve the purpose of this application scheme.
  • the speech synthesis method proposed in the embodiments of the present application may be applicable to the fields of smart mobile terminals, smart homes, and vehicle-mounted devices.
  • the speech synthesis method in the embodiments of the present application can be applied to entities with emotional speech synthesis functions, such as smart phone terminals, smart speakers, wearable smart devices, smart vehicle-mounted devices, and smart robots. These entities With emotion analysis and processing capabilities, and speech synthesis capabilities.
  • FIG. 1 is a schematic diagram of an embodiment of a speech synthesis method in an embodiment of the present application.
  • the speech synthesis method in the embodiment of the present application includes:
  • the speech synthesis apparatus obtains a target emotion type and a target emotion intensity parameter corresponding to the input text.
  • sentiment type is a sentiment classification determined according to factors such as value subject type, value leading variable and value goal orientation; sentiment intensity refers to the selective tendency of people to things and is the most important dynamic characteristic of sentiment.
  • emotion types may include: “happy”, “sad”, “surprised”, “fear”, etc., the same emotion type will have different emotional strengths when expressed.
  • the speech synthesis device may determine the target emotion type and the target emotion intensity parameter of the synthesized speech according to the emotion tags of the input text and the user's emotion intensity requirements, where the target emotion intensity parameter is the emotion mildness corresponding to the target emotion type
  • the parameter and the emotion intensity parameter are used to identify the magnitude of the emotion intensity.
  • the emotion intensity can be divided into 10 intensity levels, and the intensity intensity parameters are expressed in order of 0.1, 0.2, 0.3,... 0.9 and 1 in order from low to high.
  • the emotion type of "happy” tag can be obtained in the following ways: 1) The emotion type specified by the user is determined as the target emotion type, for example, the user uses the markup language to specify the emotion type information, which can be achieved through the corresponding speech synthesis (text to speech (TTS) software program or hardware device input; 2), if not specified by the user, you can also analyze the input text to get the target emotion type, for example, use the emotion type recognition model to analyze and get the emotion type corresponding to the input text .
  • TTS text to speech
  • the emotion intensity parameter can be obtained in the following ways: 3) Determine the value of the emotion intensity parameter specified by the user as the target emotion intensity parameter, for example, when the user inputs the input text, specify the emotion intensity parameter corresponding to the input text; 4) According to the approximate emotional intensity given by the user, such as sub-mild, mild, moderate, severe, and overweight, etc., determine the target emotional intensity parameters. Specifically, the sub-mild, mild, moderate, and The emotional intensity parameters corresponding to severe and overweight are 0.1, 0.2, 0.5, 0.7, 0.9, etc. respectively. 5) Use the default value as the target emotional intensity parameter. For example, when there is no specification, use the default emotional intensity parameter value of 0.5 as the target emotional intensity parameter corresponding to "happy".
  • the speech synthesis apparatus obtains a target emotion acoustic model corresponding to the target emotion type and the target emotion intensity parameter.
  • Emotional acoustic models refer to acoustic models corresponding to different emotion types and different emotion strengths of the emotion type.
  • the number of emotion acoustic models is at least two or more, depending on the type of emotion type and the intensity of each emotion type The number of levels is easy to understand.
  • An emotional intensity parameter can correspond to an emotional intensity level. For example, taking the speech synthesis device supporting four emotion types of "happy”, “sad”, “surprised”, and “fear" as an example, as in step 101 above, the four emotion types above are similarly divided into 10 emotion intensity levels, In this case, there are a total of 40 emotional acoustic models in the speech synthesis device.
  • the acoustic models corresponding to the emotional acoustic model and the neutral acoustic model can be constructed based on the hidden Markov model or the deep neural network model.
  • the acoustic model can also be constructed based on other mathematical models with similar functions. This application is not subject to any restrictions.
  • FIG. 2 is a schematic diagram of a deep neural network model provided in an embodiment of the present application.
  • the acoustic model in the embodiment of the present application may be implemented using a deep neural network such as (bidirectional long-term memory (BLSTM)).
  • BLSTM bidirectional long-term memory
  • Modeling, in which BLSTM is a bidirectional time-recursive neural network, is a commonly used recurrent neural network model in the field of machine learning.
  • the speech synthesis device obtains the target emotional acoustic model corresponding to the target emotional type and target emotional intensity parameter.
  • the emotional acoustic model corresponding to the target emotional type and target emotional intensity parameter is selected from the set of emotional acoustic models as the target Emotional acoustic models, wherein the set of emotional acoustic models includes at least one emotional acoustic model, and the at least one emotional acoustic model includes a target emotional acoustic model.
  • the speech synthesis device Based on the selection of the emotional acoustic model corresponding to the target emotional type and the target emotional intensity parameter from the set of emotional acoustic models as the target emotional acoustic model, further optional, the speech synthesis device also uses different emotional types and emotional intensity parameters
  • the neutral acoustic model and the corresponding emotional speech data are trained to obtain a set of emotional acoustic models.
  • the emotional speech training data is speech data with emotions corresponding to one or more emotion types.
  • the neutral acoustic model refers to the use of neutral speech training The data is obtained through model training.
  • the neutral voice training data refers to voice data that does not have the emotional color corresponding to any emotion type.
  • the emotional acoustic model can be obtained by assisting with a neutral acoustic model and using emotional speech data to adaptively train the emotional acoustic model.
  • FIG. 3 is a schematic diagram of an adaptive training process of an emotional acoustic model in an embodiment of the present application. As shown in FIG. 3, taking an emotional acoustic model training with a target emotion type of “happy” and a target emotional intensity parameter of 0.6 as an example, where The dotted arrow indicates a one-time operation, that is, it is executed only once, and the solid arrow indicates multiple iterations of the loop operation, that is, S31 and S32 only need to be executed once, while S33 and S34 are executed repeatedly.
  • the emotional acoustic training process described in Figure 3 specifically includes the following steps:
  • the above S33 and S34 are executed cyclically, and the number of executions is determined by the number of iterations of the entire emotional acoustic model training and the batch size of data samples calculated each time. That is, if the total number of samples of the training data is N, and the batch size (batch) of each execution is 32, each iteration process needs to be executed (N / 32) times; and if the entire training process ends after T iterations, Then the number of times S33 and S34 need to be performed for the entire training is (T * N / 32).
  • the above model training process separately carries out model training for each emotion type and each corresponding emotion intensity to obtain an emotion acoustic model corresponding to different emotion types and different emotion intensity parameters of the emotion type.
  • the training of the neutral acoustic model may be training using the BLSTM model as described in FIG. 2, specifically, the input of the neutral acoustic model is a text feature, and the text feature may specifically be neutral speech training Data feature set, the model parameter values of the BLSTM model are initialized with random values, and the neutral voice training data is input into the BLSTM model for batch training (for example, 32 sample data is a batch), and the corresponding neural network layers in the BLSTM model are adjusted.
  • the model parameters make the training error of acoustic features continuously decrease.
  • the training method of the emotional acoustic model is the same as the training method of the neutral model described above. The difference is that before training, the neutral acoustic model is initialized with random values, and the emotional acoustic model corresponds to the target neutral acoustic model.
  • the initialization of the model parameters of the emotional acoustic model can be: assign the model parameters of each neural network layer in the target neutral acoustic model to the corresponding parameters of each neural network layer in the emotional acoustic model in turn.
  • the input of the neutral acoustic model may be a neutral voice training data feature set or a neutral voice training data, where when the neutral voice training data is input, in step S31 .
  • the input of the emotional acoustic model training can be either emotional speech training data or a feature set of emotional speech training data.
  • the acoustic feature training error corresponding to the emotional speech training data is related to the emotion intensity parameter, wherein the acoustic feature training error corresponding to the emotional speech training data is used to characterize the acoustic prediction using the emotional speech training data during the acoustic model training process
  • the acoustic feature loss between the feature and the original acoustic feature of the emotional speech training data is described in detail in step S33 above.
  • the acoustic feature training error corresponding to the emotional speech training data can be calculated using the following error calculation formula, and the error calculation formula is:
  • loss 0.5 ⁇ (y2- ⁇ * y1- (1- ⁇ ) * y) 2 , where loss is the training error of the acoustic features corresponding to the emotional speech training data, ⁇ is the emotional intensity parameter, and y1 is the emotional speech training data.
  • the calculation of the acoustic feature training error in the model training method shown in FIG. 3 can use the above error calculation formula or other error calculation formulas.
  • the above error calculation formula is used, emotional speech needs to be obtained in advance
  • the original acoustic feature parameter y of the training data, and the specific acquisition method of the original acoustic feature parameter y are not limited in the embodiments of the present application.
  • the calculation of the acoustic feature training error can be calculated using, but not limited to, the error calculation formula described above, or other similar calculation formulas can be used for calculation. Any restrictions.
  • the speech synthesis device uses the target emotional acoustic model to convert the text features corresponding to the input text into the acoustic features corresponding to the input text.
  • the text features may include, but are not limited to, features corresponding to at least one of phonemes, syllables, words, or prosodic phrases corresponding to the text
  • the acoustic features may include, but are not limited to, fundamental frequency features corresponding to sound, line spectrum pair features, and unvoiced and voiced signs At least one of features or spectral envelope features.
  • the method further includes: a speech synthesis device performs text analysis on the input text to determine a text feature corresponding to the input text.
  • the text analysis may include, but is not limited to, at least one of text normalization operation, word segmentation operation, part-of-speech tagging operation, grammatical analysis operation, prosody prediction operation, phonetic conversion operation, or duration information analysis operation.
  • the text normalization operation refers to converting non-Chinese characters in the text, such as Arabic numerals, English symbols, and various symbols, into corresponding Chinese characters.
  • the word segmentation operation refers to dividing the continuous Chinese character string in the text into a sequence consisting of words.
  • Part-of-speech tagging refers to tagging nouns, verbs and adjectives in the text.
  • the grammatical analysis operation refers to analyzing the grammar and semantic structure of each sentence in the text, determining the semantic center, the stress position and intonation of the sentence, thereby providing important information for the prosody prediction operation.
  • the prosody prediction operation refers to predicting prosody structures at different levels in each sentence corresponding to the text, such as prosody words, prosody phrases, intonation phrases, etc.
  • the phonetic conversion operation refers to the conversion of Chinese characters into Pinyin.
  • the duration information analysis operation refers to the prediction of duration information of syllables, vowels, phonemes, and status in speech.
  • the speech synthesis device generates target emotional speech according to the acoustic features corresponding to the input text.
  • a vocoder is used to synthesize the corresponding acoustic features of the input text into the corresponding target emotional speech, wherein the emotion type of the target emotional speech is the above target emotion type, and the emotional intensity value of the target emotional speech is equal to the value of the target emotional intensity parameter.
  • the vocoder mentioned above may include but not limited to STRAIGHT vocoder or WORLD vocoder, and may also be other types of vocoders.
  • the emotional acoustic model corresponding to different emotional intensity parameters in different emotional types is obtained through big data training, and the emotional type and emotional intensity parameters corresponding to the text are extracted, and the emotional acoustic model is selected according to the emotional type and emotional intensity parameters
  • the target emotional acoustic model uses the target emotional acoustic model to convert text features into corresponding acoustic features, and finally synthesizes emotional speech data.
  • the emotional acoustic model is obtained by data training based on different emotional intensity parameters in different emotional types, the acoustic characteristics obtained by using the emotional acoustic model are more accurate, and can synthesize speech of different emotional types and different emotional intensities, and improve the synthesis of speech in Diversity in emotional expression.
  • the speech synthesis method in the embodiment of the present application uses an adaptive learning technology based on emotion intensity parameters, so that only neutral speech training data and a small amount of emotional speech training data can be used to train emotional acoustic models of different emotional strengths, because The construction cost of emotional speech training data is higher than that of neutral speech training data. Therefore, the speech synthesis method in the embodiment of the present application can also reduce the amount of emotional speech training data, thereby reducing the training data construction cost.
  • the above speech synthesis device includes a hardware structure and / or a software module corresponding to each function.
  • the present application can be implemented in the form of hardware or a combination of hardware and computer software. Whether a function is executed by hardware or computer software driven hardware depends on the specific application and design constraints of the technical solution. Professional technicians can use different methods to implement the described functions for each specific application, but such implementation should not be considered beyond the scope of this application.
  • the speech synthesis apparatus described in FIG. 1 may be implemented by one physical device, or may be implemented by multiple physical devices together, or may be a logical function module in a physical device. This is not specifically limited.
  • the speech synthesis device described in FIG. 1 may be realized by the speech synthesis device in FIG. 4.
  • 4 is a schematic diagram of a hardware structure of the speech synthesis apparatus provided in the embodiment of the present application.
  • the speech synthesis device 400 includes at least one processor 401, a communication line 402, a memory 403 and at least one communication interface 404.
  • the processor 401 can be a general-purpose central processing unit (central processing unit, CPU), microprocessor, application-specific integrated circuit (application-specific integrated circuit, server IC), or one or more used to control the execution of the application program program Integrated circuit.
  • CPU central processing unit
  • microprocessor application-specific integrated circuit
  • server IC application-specific integrated circuit
  • the communication line 402 may include a path for transferring information between the aforementioned components.
  • Communication interface 404 using any device such as a transceiver, for communicating with other devices or communication networks, such as Ethernet, radio access network (RAN), wireless local area network (WLAN), etc. .
  • RAN radio access network
  • WLAN wireless local area network
  • the memory 403 may be a read-only memory (ROM) or other types of static storage devices that can store static information and instructions, a random access memory (random access memory, RAM), or other types of information and instructions that can be stored
  • the dynamic storage device can also be electrically erasable programmable read-only memory (electrically programmable) read-only memory (EEPROM), read-only compact disc (compact disc read-only memory (CD-ROM) or other optical disc storage, optical disc storage (Including compact discs, laser discs, optical discs, digital versatile discs, Blu-ray discs, etc.), magnetic disk storage media or other magnetic storage devices, or can be used to carry or store desired program code in the form of instructions or data structures and can be used by a computer Access to any other media, but not limited to this.
  • the memory may exist independently and be connected to the processor through the communication line 402. The memory can also be integrated with the processor.
  • the memory 403 is used to store computer execution instructions for executing the solution of the present application, and the processor 401 controls execution.
  • the processor 401 is used to execute computer-executed instructions stored in the memory 403, thereby implementing the speech synthesis method provided by the following embodiments of the present application.
  • the computer execution instructions in the embodiments of the present application may also be called application program codes, which are not specifically limited in the embodiments of the present application.
  • the processor 401 may include one or more CPUs, such as CPU0 and CPU1 in FIG. 4.
  • the speech synthesis apparatus 400 may include multiple processors, such as the processor 401 and the processor 408 in FIG. 4. Each of these processors can be a single-core (single-CPU) processor or a multi-core (multi-CPU) processor.
  • the processor here may refer to one or more devices, circuits, and / or processing cores for processing data (eg, computer program instructions).
  • the speech synthesis apparatus 400 may further include an output device 405 and an input device 406.
  • the output device 405 communicates with the processor 401 and can display information in various ways.
  • the output device 405 may be a liquid crystal display (LCD), a light emitting diode (LED) display device, a cathode ray tube (CRT) display device, or a projector. Wait.
  • the input device 406 communicates with the processor 401 and can receive user input in a variety of ways.
  • the input device 406 may be a mouse, a keyboard, a touch screen device, or a sensing device.
  • the above speech synthesis apparatus 400 may be a general-purpose device or a dedicated device.
  • the speech synthesis device 400 may be a desktop computer, a portable computer, a web server, a personal digital assistant (PDA), a mobile phone, a tablet computer, a wireless terminal device, an embedded device, or a similar structure as shown in FIG. 4 device of.
  • PDA personal digital assistant
  • the embodiment of the present application does not limit the type of the speech synthesis device 400.
  • the embodiments of the present application may divide the function modules of the speech synthesis module according to the above method example, for example, each function module may be divided corresponding to each function, or two or more functions may be integrated into one processing module.
  • the above integrated modules may be implemented in the form of hardware or software function modules. It should be noted that the division of the modules in the embodiments of the present application is schematic, and is only a division of logical functions. In actual implementation, there may be another division manner.
  • FIG. 5 is a schematic structural diagram of an embodiment of a speech synthesis apparatus provided in an embodiment of the present application.
  • the speech synthesis apparatus 50 in the embodiment of the present application includes: a processing module 501;
  • the processing module 501 is used to perform the following operations:
  • the target emotion intensity parameter is used to characterize the emotion intensity corresponding to the target emotion type
  • the processing module 501 is specifically configured to: determine the target emotion type according to the emotion label of the input text; and determine the target emotion intensity parameter according to the emotion intensity requirement corresponding to the input text.
  • the processing module 501 is specifically configured to: select the emotional acoustic model corresponding to the target emotional type and the target emotional intensity parameter from the emotional acoustic model set as the target emotional acoustic model.
  • the emotional acoustic model set includes multiple Emotional acoustic models, multiple emotional acoustic models include the target emotional acoustic model.
  • the processing module 501 is further configured to: use a neutral acoustic model and corresponding emotional speech training data for model training to obtain an emotional acoustic model for different emotion types and different emotion intensity parameters of the emotion type Collection, the collection of emotional acoustic models includes the emotional acoustic model corresponding to each emotion intensity parameter corresponding to each emotional type, and the emotional speech training data is data with emotions corresponding to one or more emotional types, neutral acoustics
  • the model is obtained by using neutral speech training data for model training.
  • the neutral speech training data is data that does not have emotions corresponding to any emotion type.
  • the acoustic feature training error corresponding to the emotional speech training data is related to the emotion intensity parameter, and the acoustic feature training error corresponding to the emotional speech training data is used to characterize the use of emotional speech training data in the acoustic model training process The acoustic feature loss between the predicted acoustic feature and the original acoustic feature of the emotional speech training data.
  • the acoustic feature training error corresponding to the emotional speech training data is calculated by an error calculation formula, and the error calculation formula is:
  • loss 0.5 ⁇ (y2- ⁇ * y1- (1- ⁇ ) * y) 2 , where loss is the training error of the acoustic features corresponding to the emotional speech training data, ⁇ is the emotional intensity parameter, and y1 is the emotional speech training data.
  • both the neutral acoustic model and the emotional acoustic model may be constructed based on a hidden Markov model or a deep neural network model.
  • the text features include: features corresponding to at least one of phonemes, syllables, words, or prosodic phrases corresponding to the text; acoustic features include: fundamental frequency features corresponding to sound, line pair features, and clear At least one of voiced mark feature or spectrum envelope feature.
  • the speech synthesis device 50 may further include: an input module 502 and an output module 503, wherein the input module 502 may be used to input the above input text into the speech synthesis device 50, and the output module 503 may be used Yu outputs the final synthesized target emotion speech.
  • the speech synthesis device 50 is presented in the form of dividing each functional module in an integrated manner.
  • Module here can refer to an application-specific integrated circuit (ASIC), a circuit, a processor and memory that execute one or more software or firmware programs, an integrated logic circuit, and / or other functions that can provide the above functions Device.
  • ASIC application-specific integrated circuit
  • processor and memory that execute one or more software or firmware programs
  • integrated logic circuit and / or other functions that can provide the above functions Device.
  • FIG. 4 the speech synthesis device 50 may adopt the form shown in FIG. 4.
  • the processor 401 in FIG. 4 may call the computer stored in the memory 403 to execute instructions, so that the speech synthesis apparatus 50 executes the speech synthesis method in the above method embodiment.
  • the functions / implementation processes of the processing module 501, the input module 502, and the output module 503 in FIG. 5 can be implemented by the processor 401 in FIG. 4 calling the computer execution instructions stored in the memory 403.
  • the function / implementation process of the processing module 501 in FIG. 5 can be implemented by the processor 401 in FIG. 4 calling the computer execution instructions stored in the memory 403, and the function / implementation of the input module 502 and output module 503 in FIG.
  • the process can be realized through the communication interface 404 in FIG. 4.
  • the speech synthesis apparatus provided in the embodiments of the present application can be used to execute the above speech synthesis method, the technical effects that can be obtained can refer to the above method embodiments, and details are not described herein again.
  • the speech synthesis device is presented in the form of dividing each functional module in an integrated manner.
  • the embodiments of the present application may also divide the execution function network element and the control function network element of each function module corresponding to each function, which is not specifically limited in the embodiment of the present application.
  • an embodiment of the present application provides a chip system.
  • the chip system includes a processor for supporting a user plane functional entity to implement the foregoing speech synthesis method.
  • the chip system also includes a memory.
  • the memory is used to store program instructions and data necessary to execute the functional network element or control the functional network element.
  • the chip system may be composed of a chip, or may include a chip and other discrete devices, which is not specifically limited in the embodiments of the present application.
  • the computer program product includes one or more computer instructions.
  • the computer may be a general-purpose computer, a dedicated computer, a computer network, or other programmable devices.
  • the computer instructions may be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be from a website site, computer, server or data center Transmit to another website, computer, server or data center via wired (such as coaxial cable, optical fiber, digital subscriber line (DSL)) or wireless (such as infrared, wireless, microwave, etc.).
  • wired such as coaxial cable, optical fiber, digital subscriber line (DSL)
  • wireless such as infrared, wireless, microwave, etc.
  • the computer-readable storage medium may be any available medium that can be stored by a computer or a data storage device including a server, a data center, and the like integrated with one or more available media.
  • the usable medium may be a magnetic medium (for example, a floppy disk, a hard disk, a magnetic tape), an optical medium (for example, a DVD), or a semiconductor medium (for example, Solid State Disk (SSD)) or the like.
  • the program may be stored in a computer-readable storage medium, and the storage medium may include: ROM, RAM, magnetic disk or optical disk, etc.

Abstract

一种语音合成方法及语音合成装置,该方法包括:获取输入文本对应的目标情感类型和目标情感强度参数(101);根据目标情感类型和目标情感强度参数确定对应的目标情感声学模型(102);将输入文本的文本特征输入目标情感声学模型中得到输入文本的声学特征(103);根据输入文本的声学特征合成目标情感语音(104)。该方法能够合成不同情感强度的语音,提升合成语音在情感表现方面的多样性。

Description

一种语音合成方法及语音合成装置
本申请要求于2018年11月15日提交中国专利局、申请号为201811360232.2、发明名称为“一种语音合成方法及语音合成装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及人工智能领域,尤其涉及一种语音合成方法及语音合成装置。
背景技术
目前,语音合成技术在众多领域得到了广泛应用,例如智能移动终端领域、智能家居领域和车载设备领域等。对于语音合成的语音质量要求越来越高,对语音质量的要求不仅仅是“能听清”,而更多的要求“高度逼真,富有情感”等类似的高质量要求。
“高度逼真,富有情感”的语音合成是当前语音合成技术的一个巨大挑战。在现有语音合成技术中,通过对文本进行情感标记,并仅仅基于上述情感标记将文本合成相应情感类型对应的语音,而并没有考虑情感强度因素,导致合成的语音在情感表达方面不够丰富,只能表达类似“高兴”或者“悲伤”等情感类型的语音,而无法控制语音的情感强度,造成语音在情感表达方面较为单一,逼真度较低以及情感丰富性较差。
发明内容
本申请实施例提供了一种语音合成方法及语音合成装置,用于合成不同情感强度的情感类型对应的情感语音,提高情感语音的逼真度,提升情感语音在情感表达的丰富性。
为了达到上述技术目的,本申请实施例提供了以下技术方案:
第一方面,本申请实施例提供了一种语音合成方法,包括:首先,获取输入文本的目标情感类型以及目标情感强度参数,其中,该目标情感强度参数用于表征目标情感类型对应的情感强度;其次,获取目标情感类型以及目标情感强度参数对应的目标情感声学模型,再次,将输入文本的文本特征输入目标情感声学模型中得到该输入文本的声学特征;最终,根据该输入文本的声学特征合成目标情感语音。其中,容易理解,该目标情感声学模型是针对目标情感类型以及该目标情感类型对应的情感强度进行模型训练得到的声学模型;并基于该目标情感声学模型得到输入文本对应的声学特征,最终,将上述声学特征合成对应的目标情感语音。
从上述技术方案中可以看出,本申请技术方案具有以下优点:通过目标情感类型以及该目标情感类型的情感强度对应的目标情感声学模型,将输入文本的文本特征转换成声学特征,并最终将该声学特征合成目标情感语音。由于目标情感声学模型是基于目标情感类型以及该目标情感类型的情感强度得到的,因此,使用该目标情感声学模型可以获得与情感类型的情感强度相关的声学特征,以合成不同情感类型且不同情感强度的语音,提升合成语音在情感表现方面的多样性。
可选的,结合上述第一方面,在本申请实施例第一方面的第一种可能的实现方式中,获取输入文本的目标情感类型和目标情感强度参数,包括:根据输入文本的情感标签确定输入文本的目标情感类型,其中,情感标签用于表征输入文本的情感类型;根据该输入文 本对应的情感强度要求确定目标情感强度参数,可选的,情感强度要求由用户指定的。
可选的,结合上述第一方面或第一方面的第一种可能的实现方式,在本申请实施例第一方面的第二种可能的实现方式中,上述获取与目标情感类型和目标情感强度参数对应的目标情感声学模型包括:从情感声学模型集合中选取与目标情感类型和目标情感强度参数对应的情感声学模型作为目标情感声学模型,情感声学模型集合包括多个情感声学模型,多个情感声学模型包括目标情感声学模型。
可选的,结合上述第一方面的第二种可能的实现方式,在本申请实施例第一方面的第三种可能的实现方式中,在获取与目标情感类型和目标情感强度参数对应的目标情感声学模型之前,所述方法还包括:针对不同的情感类型和情感类型对应的不同情感强度参数,利用中性声学模型和对应的情感语音训练数据进行模型训练得到情感声学模型集合,该情感语音训练数据为具备一种或多种情感类型对应的情感的数据,该中性声学模型是使用中性语音训练数据进行模型训练得到的,该中性语音训练数据为不具备任何一种情感类型对应的情感的数据。
可选的,结合上述第一方面的第三种可能的实现方式,在本申请实施例第一方面的第四种可能的实现方式中,上述情感语音训练数据对应的声学特征训练误差与上述目标情感强度参数相关,其中,情感语音训练数据对应的声学特征训练误差用于表征在声学模型训练过程中使用情感语音训练数据预测得到的声学特征与该情感语音训练数据的原始声学特征之间的声学特征损失。
可选的,结合上述第一方面的第四种可能的实现方式,在本申请实施例第一方面的第五种可能的实现方式中,上述情感语音训练数据对应的声学特征训练误差与上述目标情感强度参数相关可以体现在:上述声学特征训练误差是由误差计算公式进行计算得到的,误差计算公式可以为:loss=0.5×(y2-β*y1-(1-β)*y) 2,其中,loss为声学特征训练误差,β为目标情感强度参数,y1为情感语音训练数据在所述中性声学模型下预测得到的声学特征参数,y2为情感语音训练数据在所述目标情感声学模型下预测得到的声学特征参数,y为情感语音训练数据的原始声学特征参数。
可选的,结合上述第一方面的第三种可能的实现方式至第五种可能的实现方式,在本申请实施例第一方面的第六种可能的实现方式中,中性声学模型和情感声学模型均可以基于隐马尔科夫模型或者深度神经网络模型构造得到的。
可选的,结合上述第一方面、第一方面的第一种可能的实现方式至第一方面的第五种可能的实现方式中的任意一种实现方式,在本申请实施例第一方面的第六种可能的实现方式中,上述文本特征包括:文本对应的音素、音节、词语或韵律短语中至少一项对应的特征;上述声学特征包括:声音对应的基频特征、线谱对特征、清浊音标志特征或频谱包络特征中的至少一项。
第二方面,本申请实施例提供了一种语音合成装置,该语音合成装置具有实现上述第一方面或第一方面任意一种可能实现方式的方法的功能。该功能可以通过硬件实现,也可以通过硬件执行相应的软件实现。该硬件或软件包括一个或多个与上述功能相对应的模块。
第三方面,本申请实施例提供了一种语音合成装置,包括:处理器和存储器;该存储器用于存储计算机执行指令,当该语音合成装置运行时,该处理器执行该存储器存储的该计算机执行指令,以使该执行功能网元执行如上述第一方面或第一方面任意一种可能实现 方式的语音合成方法。
第四方面,本申请实施例提供了一种计算机可读存储介质,该计算机可读存储介质中存储有指令,当其在计算机上运行时,使得计算机可以执行上述第一方面或第一方面任意一种可能实现方式的语音合成方法。
第五方面,本申请实施例提供了一种包含计算机操作指令的计算机程序产品,当其在计算机上运行时,使得计算机可以执行上述第一方面或第一方面任意一种可能实现方式的语音合成方法。
第六方面,本申请实施例提供了一种芯片系统,该芯片系统包括处理器,用于支持语音合成装置实现上述第一方面或第一方面任意一种可能的实现方式中所涉及的功能。在一种可能的设计中,芯片系统还包括存储器,存储器,用于保存控制功能网元必要的程序指令和数据。该芯片系统,可以由芯片构成,也可以包含芯片和其他分立器件。
第七方面,本申请实施例提供了一种情感模型训练方法,该情感模型训练方法可以用于上述第一方面中情感声学模型的训练,其具体训练方法包括:基于隐马尔科夫模型或者深度神经网络模型得到情感声学模型,并将中性声学模型的最终模型参数作为该情感声学模型的初始化模型参数,将情感类型为“高兴”并且“高兴”的情感强度为0.5对应的文本特征输入上述初始化的“高兴”情感声学模型中,基于情感强度参数0.5计算该文本特征对应的声学特征的训练误差,当训练误差大于预设误差时,进行迭代计算直到训练误差小于或者等于预设误差,此时,将训练误差小于或者等于预设误差时对应的模型参数作为“高兴”情感声学模型的最终模型参数,以完成对上述“高兴”情感声学模型的训练。
同理,基于上述情感模型训练方法可以训练得到“高兴”情感类型的其他情感强度参数(如0.1、0.2、0.3、0.4、0.6、0.7、0.8、0.9和1)对应的“高兴”情感声学模型,进一步的,采用上述情感模型训练方法也可以得到其他情感类型的情感声学模型,例如“悲伤”情感声学模型、“惊讶”情感声学模型”和“恐惧”情感声学模型,从而,最终将各种情感类型以及情感类型的情感强度参数的情感声学模型组成情感声学模型集合。
可选的,对于中性声学模型的训练与上述“高兴”情感声学模型的训练方式类似,具体可以是:基于隐马尔科夫模型或者深度神经网络模型构建得到中性声学模型,将中性声学模型中各神经网络层对应的模型参数使用随机值初始化,进而,在对中性声学模型的模型参数进行初始化之后使用没有携带任何情感类型的中性语音训练数据带入该中性声学模型中进行训练,并将训练误差小于预设误差时对应的模型参数确定为该中性声学模型的最终模型参数,以完成对中性声学模型的训练。
第八方面,本申请实施例还提供了一种情感模型训练装置,该情感模型训练装置具有实现上述第七方面或第七方面任意一种可能实现方式的方法的功能。该功能可以通过硬件实现,也可以通过硬件执行相应的软件实现。该硬件或软件包括一个或多个与上述功能相对应的模块。
其中,上述第二方面至第八方面中任一种实现方式所带来的技术效果可参见第一方面中不同实现方式所带来的技术效果,此处不再赘述。
附图说明
图1为本申请实施例中提供的语音合成方法的一个实施例示意图;
图2为本申请实施例中提供的一种深度神经网络模型示意图;
图3为本申请实施例中情感声学模型的一种自适应训练流程示意图;
图4为本申请实施例中提供的语音合成装置的一个硬件结构示意图;
图5为本申请实施例中提供的语音合成装置的一个实施例结构示意图。
具体实施方式
下面结合附图,对本申请的实施例进行描述,显然,所描述的实施例仅仅是本申请一部分的实施例,而不是全部的实施例。本领域普通技术人员可知,随着技术的发展和新场景的出现,本申请实施例提供的技术方案对于类似的技术问题,同样适用。
本申请实施例提供了一种语音合成方法及语音合成装置,适用于合成不同情感强度的语音,提升合成语音在情感表现方面的多样性。以下分别进行详细说明。
本申请中出现的术语“和/或”,可以是一种描述关联对象的关联关系,表示可以存在三种关系,例如,A和/或B,可以表示:单独存在A,同时存在A和B,单独存在B这三种情况。另外,本申请中字符“/”,一般表示前后关联对象是一种“或”的关系。
本申请的说明书和权利要求书及上述附图中的术语“第一”、“第二”等是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。应该理解这样使用的数据在适当情况下可以互换,以便这里描述的实施例能够以除了在这里图示或描述的内容以外的顺序实施。此外,术语“包括”和“具有”以及他们的任何变形,意图在于覆盖不排他的包含,例如,包含了一系列步骤或模块的过程、方法、系统、产品或设备不必限于清楚地列出的那些步骤或模块,而是可包括没有清楚地列出的或对于这些过程、方法、产品或设备固有的其它步骤或模块。在本申请中出现的对步骤进行的命名或者编号,并不意味着必须按照命名或者编号所指示的时间/逻辑先后顺序执行方法流程中的步骤,已经命名或者编号的流程步骤可以根据要实现的技术目的变更执行次序,只要能达到相同或者相类似的技术效果即可。本申请中所出现的模块的划分,是一种逻辑上的划分,实际应用中实现时可以有另外的划分方式,例如多个模块可以结合成或集成在另一个系统中,或一些特征可以忽略,或不执行,另外,所显示的或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,模块之间的间接耦合或通信连接可以是电性或其他类似的形式,本申请中均不作限定。并且,作为分离部件说明的模块或子模块可以是也可以不是物理上的分离,可以是也可以不是物理模块,或者可以分布到多个电路模块中,可以根据实际的需要选择其中的部分或全部模块来实现本申请方案的目的。
本申请实施例所提出的语音合成方法可以适用于智能移动终端领域、智能家居领域和车载设备领域等。具体来说,本申请实施例中的语音合成方法可以应用于具有情感语音合成功能的实体中,例如,智能手机终端、智能音箱、穿戴式智能设备、智能车载设备和智能机器人等,上述这些实体具有情感分析处理能力,以及语音合成能力。
为了便于理解本申请实施例中所述的语音合成方法,下面将结合具体的实施例对该语音合成方法进行详细说明。
图1为本申请实施例中语音合成方法的一个实施例示意图。
如图1所示,本申请实施例中的语音合成方法,包括:
101、语音合成装置获取输入文本对应的目标情感类型和目标情感强度参数。
众所周知,情感类型是一种依据价值主体类型、价值主导变量和价值目标导向等因素确定的一种情感分类;情感强度是指人对事物所产生的选择性倾向,是情感最重要的动力特性。具体来说,情感类型可以包括:“高兴”、“悲伤”、“惊讶”、“恐惧”等,同 一种情感类型在表达时会存在不同的情感强度。
可选的,语音合成装置可以根据输入文本的情感标签,以及用户的情感强度要求确定合成语音的目标情感类型和目标情感强度参数,其中,该目标情感强度参数是目标情感类型对应的情感轻度参数,情感强度参数用于标识情感强度大小,例如,情感强度可以划分为10个强度等级,从强度等级有低到高依次用情感强度参数为0.1、0.2、0.3、……0.9和1表示。
例如,对于输入文本“明天就是周末了,太开心了”,以“高兴”这一情感类型来进行语音合成为例。对于情感类型,即“高兴”标签可以通过以下方式获得:1)、将用户指定的情感类型确定为目标情感类型,例如,用户使用标记语言指定情感类型信息,具体可以通过对应的语音合成(text to speech,TTS)软件程式或者硬件设备输入;2)、若没有用户指定,还可以通过对输入文本进行情感分析得到目标情感类型,例如,使用情感类型识别模型进行分析得到输入文本对应的情感类型。对于情感强度参数可以通过如下几种方式获得:3)、将用户指定的情感强度参数值确定为目标情感强度参数,例如,在用户输入该输入文本时,指定该输入文本对应的情感强度参数;4)、根据用户给出的大致情感强度,如次轻度、轻度、适中、重度、超重度等,确定目标情感强度参数,具体的,可以预先设定次轻度、轻度、适中、重度、超重度分别对应的情感强度参数为0.1、0.2、0.5、0.7、0.9等。5)、使用默认值作为目标情感强度参数,例如,当没有任何指定时,使用默认情感强度参数值0.5作为“高兴”对应的目标情感强度参数。
102、语音合成装置获取与目标情感类型和目标情感强度参数对应的目标情感声学模型。
情感声学模型是指不同情感类型以及该情感类型的不同情感强度对应的声学模型,情感声学模型的数量为至少两个以上,具体数量取决于情感类型的种类多少以及每一种情感类型的情感强度等级数量,容易理解,一个情感强度参数可以对应一种情感强度等级。例如,以语音合成装置支持“高兴”、“悲伤”、“惊讶”、“恐惧”四种情感类型为例,如上述步骤101中类似将上述四种情感类型均划分为10个情感强度等级,此种情况下,语音合成装置中一共存在40个情感声学模型。
可选的,上述情感声学模型和中性声学模型对应的声学模型均可以基于隐马尔科夫模型或者深度神经网络模型进行构造得到的,当然,声学模型也可以基于其他具有类似功能的数学模型建模得到,对此本申请不做任何限制。
图2为本申请实施例中提供的一种深度神经网络模型示意图,如图2所示,本申请实施例中的声学模型可以采用深度神经网络如(bidirectional long short-term memory network,BLSTM)进行建模,其中,BLSTM是一种双向的时间递归神经网络,是一种在机器学习领域常用的循环神经网络模型。
可选的,语音合成装置获取与目标情感类型和目标情感强度参数对应的目标情感声学模型具体可以是:从情感声学模型集合中选取与目标情感类型和目标情感强度参数对应的情感声学模型作为目标情感声学模型,其中,情感声学模型集合中包括至少一个情感声学模型,至少一个情感声学模型中包括目标情感声学模型。
在从情感声学模型集合中选取与目标情感类型和目标情感强度参数对应的情感声学模型作为目标情感声学模型基础上,进一步可选的,语音合成装置还针对不同的情感类型和情感强度参数,利用中性声学模型和对应的情感语音数据进行模型训练得到情感声学模型集合,情感语音训练数据为具备一种或多种情感类型对应的情感的语音数据,中性声学模型是指使用中性语音训练数据进行模型训练得到的,该中性语音训练数据是指不具有任意一种情感类型对应的情感色彩的语音数据。
具体来说,情感声学模型可以是通过中性声学模型进行辅助,利用情感语音数据对情感声学模型进行自适应训练获得的。图3为本申请实施例中情感声学模型的一种自适应训练流程示意图,如图3所示,以目标情感类型为“高兴”,目标情感强度参数为0.6的情感声学模型训练为例,其中,虚线箭头指示的是一次性操作,即只执行一次,实线箭头指示多次迭代循环操作,即S31和S32只需执行一遍,而S33和S34则循环执行。
图3中所述的情感声学训练流程具体包括以下步骤:
S31、中性声学模型训练;
S32、初始化“高兴”情感声学模型。
S33、基于情感强度参数计算声学特征训练误差。
S34、根据声学特征训练误差更新情感声学模型。
其中,上S33和S34是循环执行的,其执行次数由整个情感声学模型训练的迭代次数和每次计算的数据样本批次大小决定。即若训练训练数据总样本数为N,每次执行的批次大小(batch)为32,则每次迭代过程需执行(N/32)次;而若整个训练过程在迭代T次后结束,则整个训练需要执行S33和S34的次数为(T*N/32)。另外,上述模型训练流程针对每种情感类型及其对应的每个情感强度分别进行模型训练,以得到不同情感类型以及情感类型的不同情感强度参数对应的情感声学模型。
在上述图3中,中性声学模型的训练可以是采用如图2所述的BLSTM模型进行训练,具体来说,中性声学模型的输入为文本特征,该文本特征具体可以是中性语音训练数据特征集,BLSTM模型的模型参数值使用随机值初始化,中性语音训练数据分批次(例如32个样本数据为一批)输入至BLSTM模型进行训练,调整BLSTM模型中各神经网络层对应的模型参数,使得声学特征训练误差不断减少。通过对中性语音训练数据进行多次迭代训练,直到迭代次数达到预置次数或者声学特征训练误差达到预定值时终止数据训练,将最终得到的模型参数作为目标中性模型输出。另外,情感声学模型的训练方式与上述所述的中性模型的训练方式一致,区别在于:在训练前,中性声学模型使用随机值进行初始化,而情感声学模型使用上述目标中性声学模型对应的模型参数进行初始化,情感声学模型的初始化操作具有可以是:将目标中性声学模型中各神经网络层的模型参数,依次赋值给情感声学模型中各神经网络层的对应参数。
需要说明的是,上述图3中所示出的只是一个声学模型的训练方法示意图,而不表征其具体实现方式。可选的,在具体实现上,中性声学模型中输入的可以是中性语音训练数据特征集,也可以是中性语音训练数据,其中,当输入中性语音训练数据时,在步骤S31的中性声学模型训练过程中还需要将中性语音训练数据进行文本特征提取以得到中性语音 训练数据特征集,以继续执行中性模型训练过程。类似的,情感声学模型训练的输入也既可以是情感语音训练数据,也可以是情感语音训练数据特征集。
可选的,情感语音训练数据对应的声学特征训练误差与情感强度参数相关,其中,情感语音训练数据对应的声学特征训练误差用于表征在声学模型训练过程中使用情感语音训练数据预测得到的声学特征与该情感语音训练数据的原始声学特征之间的声学特征损失,具体描述如上述步骤S33中。进一步可选的,情感语音训练数据对应的声学特征训练误差可以采用如下误差计算公式计算得到,该误差计算公式为:
loss=0.5×(y2-β*y1-(1-β)*y) 2,其中,loss为情感语音训练数据对应的声学特征训练误差,β为情感强度参数,y1为情感语音训练数据在中性声学模型下预测得到的声学特征参数,y2为情感语音训练数据在目标情感声学模型下预测得到的声学特征参数,y为情感语音训练数据的原始声学特征参数。需要说明的是,上述误差计算公式仅仅用于说明可以用于计算声学特征训练误差的一种可行的计算方式,与上述图3中所示的模型训练方法没有直接对应关系。具体来说,图3中所示的模型训练方法中声学特征训练误差的计算即可以采用上述误差计算公式,也可以采用其他误差计算公式,当采用上述误差计算公式时,还需要提前获取情感语音训练数据的原始声学特征参数y,至于其原始声学特征参数y的具体获取方式在本申请实施例中并不做任何限制。
在本申请实施例中,需要说明的是,声学特征训练误差的计算可以采用但不限于上面所述的误差计算公式进行计算,也可以采用其他类似的计算公式进行计算,对此本申请不做任何限制。
103、语音合成装置使用目标情感声学模型,将输入文本对应的文本特征转换为输入文本对应的声学特征。
其中,文本特征可以包括但不限于文本对应的音素、音节、词语或韵律短语中至少一项对应的特征,声学特征可以包括但不限于声音对应的基频特征、线谱对特征、清浊音标志特征或频谱包络特征中的至少一项。
可选的,在该语音合成方法中,还包括:语音合成装置对输入文本进行文本分析确定输入文本对应的文本特征。
具体来说,该文本分析可以包括但不限于文本规范化操作、分词操作、词性标注操作、语法分析操作、韵律预测操作、字音转换操作或时长信息分析操作中的至少一项操作。
其中,文本规范化操作是指将文本中的非汉字字符,如阿拉伯数字、英文符号、各种符号等转换为对应的汉字字符。
分词操作是指将文本中连续的汉语字串分割成由词组成的序列。
词性标注操作是指将文本中的名词、动词和形容词等标注出来。
语法分析操作是指分析文本中每个句子的语法和语义结构,确定语义中心,句子的重音位置与语调,从而为韵律预测操作提供重要信息。
韵律预测操作是指预测文本对应的每个句子中不同层级的韵律结构,如韵律词、韵律短语、语调短语等。
字音转换操作是指将汉字转换为拼音。
时长信息分析操作是指预测语音中音节、声韵母、音素、状态等的时长信息。
104、语音合成装置根据输入文本对应的声学特征生成目标情感语音。
利用声码器将输入文本对应的声学特征合成对应的目标情感语音,其中,目标情感语音的情感类型为上述目标情感类型,目标情感语音的情感强度值与上述目标情感强度参数的值相等。
其中,上述声码器可以包括但不限于STRAIGHT声码器或WORLD声码器,也可以是其他类型的声码器。本申请实施例中,通过大数据训练得到不同情感类型中不同情感强度参数对应的情感声学模型,并提取文本对应的情感类型和情感强度参数,依据情感类型和情感强度参数进行情感声学模型选择对应的目标情感声学模型,使用目标情感声学模型将文本特征转换为对应的声学特征,最终合成情感语音数据。由于情感声学模型是根据不同情感类型中不同情感强度参数进行数据训练得到的,因此,使用该情感声学模型获得的声学特征更加准确,可以合成不同情感类型且不同情感强度的语音,提升合成语音在情感表现方面的多样性。
进一步的,本申请实施例中语音合成方法使用基于情感强度参数的自适应学习技术,使得只需中性语音训练数据和少量的情感语音训练数据便能训练出不同情感强度的情感声学模型,由于情感语音训练数据的构建成本高于中性语音训练数据,因此,通过本申请实施例中的语音合成方法还可以减少情感语音训练数据的用量,从而降低训练数据构建成本。
上述主要从语音合成装置的角度对本申请实施例提供的方案进行了介绍。可以理解的是,上述语音合成装置为了实现上述功能,其包含了执行各个功能相应的硬件结构和/或软件模块。本领域技术人员应该很容易意识到,结合本文中所公开的实施例描述的各示例的模块及算法步骤,本申请能够以硬件或硬件和计算机软件的结合形式来实现。某个功能究竟以硬件还是计算机软件驱动硬件的方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请的范围。
从硬件结构上来描述,图1中所述的语音合成装置可以由一个实体设备实现,也可以由多个实体设备共同实现,还可以是一个实体设备内的一个逻辑功能模块,本申请实施例对此不作具体限定。
例如,图1中所述的语音合成装置可以通过图4中的语音合成装置来实现。图4所示为本申请实施例中提供的语音合成装置的一个硬件结构示意图。
如图4所示,该语音合成装置400包括至少一个处理器401,通信线路402,存储器403以及至少一个通信接口404。
处理器401可以是一个通用中央处理器(central processing unit,CPU),微处理器,特定应用集成电路(application-specific integrated circuit,服务器IC),或一个或多个用于控制本申请方案程序执行的集成电路。
通信线路402可包括一通路,在上述组件之间传送信息。
通信接口404,使用任何收发器一类的装置,用于与其他设备或通信网络通信,如以太网,无线接入网(radio access network,RAN),无线局域网(wireless local area networks,WLAN)等。
存储器403可以是只读存储器(read-only memory,ROM)或可存储静态信息和指令的其他类型的静态存储设备,随机存取存储器(random access memory,RAM)或者可存储信息和指令的其他类型的动态存储设备,也可以是电可擦可编程只读存储器(electrically erable programmable read-only memory,EEPROM)、只读光盘(compact disc read-only memory,CD-ROM)或其他光盘存储、光碟存储(包括压缩光碟、激光碟、光碟、数字通用光碟、蓝光光碟等)、磁盘存储介质或者其他磁存储设备、或者能够用于携带或存储具有指令或数据结构形式的期望的程序代码并能够由计算机存取的任何其他介质,但不限于此。存储器可以是独立存在,通过通信线路402与处理器相连接。存储器也可以和处理器集成在一起。
其中,存储器403用于存储执行本申请方案的计算机执行指令,并由处理器401来控制执行。处理器401用于执行存储器403中存储的计算机执行指令,从而实现本申请下述实施例提供的语音合成方法。
可选的,本申请实施例中的计算机执行指令也可以称之为应用程序代码,本申请实施例对此不作具体限定。
在具体实现中,作为一种实施例,处理器401可以包括一个或多个CPU,例如图4中的CPU0和CPU1。
在具体实现中,作为一种实施例,语音合成装置400可以包括多个处理器,例如图4中的处理器401和处理器408。这些处理器中的每一个可以是一个单核(single-CPU)处理器,也可以是一个多核(multi-CPU)处理器。这里的处理器可以指一个或多个设备、电路、和/或用于处理数据(例如计算机程序指令)的处理核。
在具体实现中,作为一种实施例,语音合成装置400还可以包括输出设备405和输入设备406。输出设备405和处理器401通信,可以以多种方式来显示信息。例如,输出设备405可以是液晶显示器(liquid crystal display,LCD),发光二级管(light emitting diode,LED)显示设备,阴极射线管(cathode ray tube,CRT)显示设备,或投影仪(projector)等。输入设备406和处理器401通信,可以以多种方式接收用户的输入。例如,输入设备406可以是鼠标、键盘、触摸屏设备或传感设备等。
上述的语音合成装置400可以是一个通用设备或者是一个专用设备。在具体实现中,语音合成装置400可以是台式机、便携式电脑、网络服务器、掌上电脑(personal digital assistant,PDA)、移动手机、平板电脑、无线终端设备、嵌入式设备或有图4中类似结构的设备。本申请实施例不限定语音合成装置400的类型。
本申请实施例可以根据上述方法示例对语音合成模块进行功能模块的划分,例如,可以对应各个功能划分各个功能模块,也可以将两个或两个以上的功能集成在一个处理模块中。上述集成的模块既可以采用硬件的形式实现,也可以采用软件功能模块的形式实现。需要说明的是,本申请实施例中对模块的划分是示意性的,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式。
比如,以采用集成的方式划分各个功能模块的情况下,图5为本申请实施例中提供的语音合成装置的一个实施例结构示意图。
如图5所示,本申请实施例中的语音合成装置50包括:处理模块501;
其中,处理模块501,用于执行以下操作:
获取输入文本的目标情感类型和目标情感强度参数,目标情感强度参数用于表征目标情感类型对应的情感强度;
获取与目标情感类型和目标情感强度参数对应的目标情感声学模型;
将输入文本的文本特征输入目标情感声学模型中得到输入文本的声学特征;
根据输入文本的声学特征合成目标情感语音。
可选的,在一种示例中,处理模块501具体用于:根据输入文本的情感标签确定目标情感类型;根据输入文本对应的情感强度要求确定目标情感强度参数。
可选的,在一种示例中,处理模块501具体用于:从情感声学模型集合中选取与目标情感类型和目标情感强度参数对应的情感声学模型作为目标情感声学模型,情感声学模型集合包括多个情感声学模型,多个情感声学模型包括所述目标情感声学模型。
可选的,在一种示例中,处理模块501还用于:针对不同的情感类型和情感类型的不同情感强度参数,使用中性声学模型和对应的情感语音训练数据进行模型训练得到情感声学模型集合,情感声学模型集合中包括与每一个中情感类型对应的每一种情感强度参数对应的情感声学模型,情感语音训练数据为具备一种或多种情感类型对应的情感的数据,中性声学模型是使用中性语音训练数据进行模型训练得到的,中性语音训练数据为不具备任何一种情感类型对应的情感的数据。
可选的,在一种示例中,情感语音训练数据对应的声学特征训练误差与情感强度参数相关,情感语音训练数据对应的声学特征训练误差用于表征在声学模型训练过程中使用情感语音训练数据预测得到的声学特征与情感语音训练数据的原始声学特征之间的声学特征损失。
可选的,在一种示例中,情感语音训练数据对应的声学特征训练误差是由误差计算公式进行计算得到的,误差计算公式为:
loss=0.5×(y2-β*y1-(1-β)*y) 2,其中,loss为情感语音训练数据对应的声学特征训练误差,β为情感强度参数,y1为情感语音训练数据在中性声学模型下预测得到的声学特征参数,y2为情感语音训练数据在目标情感声学模型下预测得到的声学特征参数,y为情感语音训练数据的原始声学特征参数。
可选的,在一种示例中,中性声学模型和情感声学模型均可以是基于隐马尔科夫模型或者深度神经网络模型构造得到的。
可选的,在一种示例中,文本特征包括:文本对应的音素、音节、词语或韵律短语中至少一项对应的特征;声学特征包括:声音对应的基频特征、线谱对特征、清浊音标志特征或频谱包络特征中的至少一项。
可选的,在一种示例中,语音合成装置50还可以包括:输入模块502和输出模块503,其中,输入模块502可以用于将上述输入文本输入语音合成装置50中,输出模块503可以用于将最终合成的目标情感语音输出。
其中,上述方法实施例涉及的各步骤的所有相关内容均可以援引到对应功能模块的功能描述,在此不再赘述。
在本实施例中,该语音合成装置50以采用集成的方式划分各个功能模块的形式来呈 现。这里的“模块”可以指特定应用集成电路(application-specific integrated circuit,ASIC),电路,执行一个或多个软件或固件程序的处理器和存储器,集成逻辑电路,和/或其他可以提供上述功能的器件。在一个简单的实施例中,本领域的技术人员可以想到语音合成装置50可以采用图4所示的形式。
比如,图4中的处理器401可以通过调用存储器403中存储的计算机执行指令,使得语音合成装置50执行上述方法实施例中的语音合成方法。
具体来说,图5中的处理模块501、输入模块502和输出模块503的功能/实现过程可以通过图4中的处理器401调用存储器403中存储的计算机执行指令来实现。或者,图5中的处理模块501的功能/实现过程可以通过图4中的处理器401调用存储器403中存储的计算机执行指令来实现,图5中的输入模块502和输出模块503的功能/实现过程可以通过图4中通信接口404来实现。
由于本申请实施例提供的语音合成装置可用于执行上述语音合成方法,因此其所能获得的技术效果可参考上述方法实施例,在此不再赘述。
上述实施例中,语音合成装置以采用集成的方式划分各个功能模块的形式来呈现。当然,本申请实施例也可以对应各个功能划分执行功能网元和控制功能网元的各个功能模块,本申请实施例对此不作具体限定。
可选的,本申请实施例提供了一种芯片系统,该芯片系统包括处理器,用于支持用户面功能实体实现上述语音合成方法。在一种可能的设计中,该芯片系统还包括存储器。该存储器,用于保存执行功能网元或控制功能网元必要的程序指令和数据。该芯片系统,可以由芯片构成,也可以包含芯片和其他分立器件,本申请实施例对此不作具体限定。
在上述实施例中,可以全部或部分地通过软件、硬件、固件或者其任意组合来实现。当使用软件实现时,可以全部或部分地以计算机程序产品的形式实现。
所述计算机程序产品包括一个或多个计算机指令。在计算机上加载和执行所述计算机程序指令时,全部或部分地产生按照本申请实施例所述的流程或功能。所述计算机可以是通用计算机、专用计算机、计算机网络、或者其他可编程装置。所述计算机指令可以存储在计算机可读存储介质中,或者从一个计算机可读存储介质向另一计算机可读存储介质传输,例如,所述计算机指令可以从一个网站站点、计算机、服务器或数据中心通过有线(例如同轴电缆、光纤、数字用户线(DSL))或无线(例如红外、无线、微波等)方式向另一个网站站点、计算机、服务器或数据中心进行传输。所述计算机可读存储介质可以是计算机能够存储的任何可用介质或者是包含一个或多个可用介质集成的服务器、数据中心等数据存储设备。所述可用介质可以是磁性介质,(例如,软盘、硬盘、磁带)、光介质(例如,DVD)、或者半导体介质(例如固态硬盘Solid State Disk(SSD))等。
本领域普通技术人员可以理解上述实施例的各种方法中的全部或部分步骤是可以通过程序来指令相关的硬件来完成,该程序可以存储于一计算机可读存储介质中,存储介质可以包括:ROM、RAM、磁盘或光盘等。
以上对本申请实施例所提供的语音合成方法及语音合成装置进行了详细介绍,本文中应用了具体个例对本申请的原理及实施方式进行了阐述,以上实施例的说明只是用于帮助 理解本申请的方法及其核心思想;同时,对于本领域的一般技术人员,依据本申请的思想,在具体实施方式及应用范围上均会有改变之处,综上所述,本说明书内容不应理解为对本申请的限制。

Claims (16)

  1. 一种语音合成方法,其特征在于,包括:
    获取输入文本的目标情感类型和目标情感强度参数,所述目标情感强度参数用于表征所述目标情感类型对应的情感强度;
    获取与所述目标情感类型和所述目标情感强度参数对应的目标情感声学模型;
    将所述输入文本的文本特征输入所述目标情感声学模型中得到所述输入文本的声学特征;
    根据所述输入文本的声学特征合成目标情感语音。
  2. 根据权利要求1所述的语音合成方法,其特征在于,所述获取输入文本的目标情感类型和目标情感强度参数,包括:
    根据所述输入文本的情感标签确定所述目标情感类型;
    根据所述输入文本对应的情感强度要求确定所述目标情感强度参数。
  3. 根据权利要求1或2所述的语音合成方法,其特征在于,所述获取与所述目标情感类型和所述目标情感强度参数对应的目标情感声学模型包括:
    从情感声学模型集合中选取与所述目标情感类型和所述目标情感强度参数对应的情感声学模型作为所述目标情感声学模型,所述情感声学模型集合包括多个情感声学模型,所述多个情感声学模型包括所述目标情感声学模型。
  4. 根据权利要求3所述的语音合成方法,其特征在于,在所述根据所述目标情感类型和所述目标情感强度参数,从情感声学模型集合中选取对应的所述目标情感声学模型之前,还包括:
    针对不同的情感类型和情感类型的不同情感强度参数,使用中性声学模型和对应的情感语音训练数据进行模型训练得到所述情感声学模型集合,所述情感语音训练数据为具备一种或多种情感类型对应的情感的数据,所述中性声学模型是使用中性语音训练数据进行模型训练得到的,所述中性语音训练数据为不具备任何一种情感类型对应的情感的数据。
  5. 根据权利要求4所述的语音合成方法,其特征在于,所述情感语音训练数据对应的声学特征训练误差与所述情感强度参数相关,所述情感语音训练数据对应的声学特征训练误差用于表征在模型训练过程中使用所述情感语音训练数据预测得到的声学特征与所述情感语音训练数据的原始声学特征之间的声学特征损失。
  6. 根据权利要求5所述的语音合成方法,其特征在于,所述情感语音训练数据对应的声学特征训练误差是由误差计算公式进行计算得到的,所述误差计算公式为:
    loss=0.5×(y2-β*y1-(1-β)*y) 2
    其中,所述loss为所述情感语音训练数据对应的声学特征训练误差,所述β为所述情感强度参数,所述y1为所述情感语音训练数据在所述中性声学模型下预测得到的声学特征参数,所述y2为所述情感语音训练数据在所述目标情感声学模型下预测得到的声学特征参数,所述y为所述情感语音训练数据的原始声学特征参数。
  7. 根据权利要求4至6中任一项所述的语音合成方法,其特征在于,所述中性声学模型和所述情感声学模型均是基于隐马尔科夫模型或者深度神经网络模型构造得到的。
  8. 一种语音合成装置,其特征在于,包括:
    处理模块,用于获取输入文本的目标情感类型和目标情感强度参数,所述目标情感强度参数用于表征所述目标情感类型对应的情感强度;获取与所述目标情感类型和所述目标情感强度参数对应的目标情感声学模型;将所述输入文本的文本特征输入所述目标情感声学模型中得到所述输入文本的声学特征;根据所述输入文本的声学特征合成目标情感语音。
  9. 根据权利要求8所述的语音合成装置,其特征在于,所述处理模块具体用于:
    根据所述输入文本的情感标签确定所述目标情感类型;
    根据所述输入文本对应的情感强度要求确定所述目标情感强度参数。
  10. 根据权利要求8或9所述的语音合成装置,其特征在于,所述处理模块具体用于:
    从情感声学模型集合中选取与所述目标情感类型和所述目标情感强度参数对应的情感声学模型作为所述目标情感声学模型,所述情感声学模型集合包括多个情感声学模型,所述多个情感声学模型包括所述目标情感声学模型。
  11. 根据权利要求10所述的语音合成装置,其特征在于,所述处理模块还用于:
    针对不同的情感类型和情感类型的不同情感强度参数,使用中性声学模型和对应的情感语音训练数据进行模型训练得到所述情感声学模型集合,所述情感语音训练数据为具备一种或多种情感类型对应的情感的数据,所述中性声学模型是使用中性语音训练数据进行模型训练得到的,所述中性语音训练数据为不具备任何一种情感类型对应的情感的数据。
  12. 根据权利要求11所述的语音合成装置,其特征在于,所述情感语音训练数据对应的声学特征训练误差与所述情感强度参数相关,所述情感语音训练数据对应的声学特征训练误差用于表征在模型训练过程中使用所述情感语音训练数据预测得到的声学特征与所述情感语音训练数据的原始声学特征之间的声学特征损失。
  13. 根据权利要求12所述的语音合成装置,其特征在于,所述情感语音训练数据对应的声学特征训练误差是由误差计算公式进行计算得到的,所述误差计算公式为:
    loss=0.5×(y2-β*y1-(1-β)*y) 2
    其中,所述loss为所述情感语音训练数据对应的声学特征训练误差,所述β为所述情感强度参数,所述y1为所述情感语音训练数据在所述中性声学模型下预测得到的声学特征参数,所述y2为所述情感语音训练数据在所述目标情感声学模型下预测得到的声学特征参数,所述y为所述情感语音训练数据的原始声学特征参数。
  14. 根据权利要求11至13中任一项所述的语音合成装置,其特征在于,所述中性声学模型和所述情感声学模型均是基于隐马尔科夫模型或者深度神经网络模型构造得到的。
  15. 一种语音合成装置,其特征在于,包括:
    处理单元和存储单元;
    其中,所述存储单元,用于存储计算机操作指令;
    所述处理单元,用于通过调用所述计算机操作指令,以执行上述权利要求1至7中任一项所述的语音合成方法。
  16. 一种计算机可读存储介质,其特征在于,所述计算机可读存储介质中包括计算机操作指令,当所述计算机操作指令在计算机上运行时,使得所述计算机执行上述权利要求1至7中任一项所述的语音合成方法。
PCT/CN2019/091844 2018-11-15 2019-06-19 一种语音合成方法及语音合成装置 WO2020098269A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US16/944,863 US11282498B2 (en) 2018-11-15 2020-07-31 Speech synthesis method and speech synthesis apparatus

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201811360232.2 2018-11-15
CN201811360232.2A CN111192568B (zh) 2018-11-15 2018-11-15 一种语音合成方法及语音合成装置

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US16/944,863 Continuation US11282498B2 (en) 2018-11-15 2020-07-31 Speech synthesis method and speech synthesis apparatus

Publications (1)

Publication Number Publication Date
WO2020098269A1 true WO2020098269A1 (zh) 2020-05-22

Family

ID=70707121

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/091844 WO2020098269A1 (zh) 2018-11-15 2019-06-19 一种语音合成方法及语音合成装置

Country Status (3)

Country Link
US (1) US11282498B2 (zh)
CN (1) CN111192568B (zh)
WO (1) WO2020098269A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112951200A (zh) * 2021-01-28 2021-06-11 北京达佳互联信息技术有限公司 语音合成模型的训练方法、装置、计算机设备及存储介质

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111192568B (zh) * 2018-11-15 2022-12-13 华为技术有限公司 一种语音合成方法及语音合成装置
US11854538B1 (en) * 2019-02-15 2023-12-26 Amazon Technologies, Inc. Sentiment detection in audio data
JP7405660B2 (ja) * 2020-03-19 2023-12-26 Lineヤフー株式会社 出力装置、出力方法及び出力プログラム
CN112349272A (zh) * 2020-10-15 2021-02-09 北京捷通华声科技股份有限公司 语音合成方法、装置、存储介质及电子装置
CN112489621B (zh) * 2020-11-20 2022-07-12 北京有竹居网络技术有限公司 语音合成方法、装置、可读介质及电子设备
CN112786007B (zh) * 2021-01-20 2024-01-26 北京有竹居网络技术有限公司 语音合成方法、装置、可读介质及电子设备
CN113096640A (zh) * 2021-03-08 2021-07-09 北京达佳互联信息技术有限公司 一种语音合成方法、装置、电子设备及存储介质

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101661569A (zh) * 2009-09-18 2010-03-03 北京科技大学 一种智能情感机器人多模态行为关联表达系统
US20120078607A1 (en) * 2010-09-29 2012-03-29 Kabushiki Kaisha Toshiba Speech translation apparatus, method and program
US20160071510A1 (en) * 2014-09-08 2016-03-10 Microsoft Corporation Voice generation with predetermined emotion type
CN106531150A (zh) * 2016-12-23 2017-03-22 上海语知义信息技术有限公司 一种基于深度神经网络模型的情感合成方法
CN106653000A (zh) * 2016-11-16 2017-05-10 太原理工大学 一种基于语音信息的情感强度实验方法

Family Cites Families (20)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH04199098A (ja) * 1990-11-29 1992-07-20 Meidensha Corp 規則音声合成装置
US7457752B2 (en) * 2001-08-14 2008-11-25 Sony France S.A. Method and apparatus for controlling the operation of an emotion synthesizing device
US7401020B2 (en) * 2002-11-29 2008-07-15 International Business Machines Corporation Application of emotion-based intonation and prosody to speech in text-to-speech systems
JP2003233388A (ja) * 2002-02-07 2003-08-22 Sharp Corp 音声合成装置および音声合成方法、並びに、プログラム記録媒体
DE60215296T2 (de) * 2002-03-15 2007-04-05 Sony France S.A. Verfahren und Vorrichtung zum Sprachsyntheseprogramm, Aufzeichnungsmedium, Verfahren und Vorrichtung zur Erzeugung einer Zwangsinformation und Robotereinrichtung
JP4456537B2 (ja) * 2004-09-14 2010-04-28 本田技研工業株式会社 情報伝達装置
WO2006123539A1 (ja) * 2005-05-18 2006-11-23 Matsushita Electric Industrial Co., Ltd. 音声合成装置
US7983910B2 (en) * 2006-03-03 2011-07-19 International Business Machines Corporation Communicating across voice and text channels with emotion preservation
CN101064104B (zh) * 2006-04-24 2011-02-02 中国科学院自动化研究所 基于语音转换的情感语音生成方法
CN102005205B (zh) * 2009-09-03 2012-10-03 株式会社东芝 情感语音合成方法和装置
WO2012003602A1 (zh) * 2010-07-09 2012-01-12 西安交通大学 一种电子喉语音重建方法及其系统
CN102385858B (zh) 2010-08-31 2013-06-05 国际商业机器公司 情感语音合成方法和系统
GB2517503B (en) * 2013-08-23 2016-12-28 Toshiba Res Europe Ltd A speech processing system and method
KR102222122B1 (ko) * 2014-01-21 2021-03-03 엘지전자 주식회사 감성음성 합성장치, 감성음성 합성장치의 동작방법, 및 이를 포함하는 이동 단말기
US9824681B2 (en) * 2014-09-11 2017-11-21 Microsoft Technology Licensing, Llc Text-to-speech with emotional content
CN105991847B (zh) * 2015-02-16 2020-11-20 北京三星通信技术研究有限公司 通话方法和电子设备
CN106570496B (zh) * 2016-11-22 2019-10-01 上海智臻智能网络科技股份有限公司 情绪识别方法和装置以及智能交互方法和设备
CN107103900B (zh) * 2017-06-06 2020-03-31 西北师范大学 一种跨语言情感语音合成方法及系统
US10418025B2 (en) * 2017-12-06 2019-09-17 International Business Machines Corporation System and method for generating expressive prosody for speech synthesis
CN111192568B (zh) * 2018-11-15 2022-12-13 华为技术有限公司 一种语音合成方法及语音合成装置

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101661569A (zh) * 2009-09-18 2010-03-03 北京科技大学 一种智能情感机器人多模态行为关联表达系统
US20120078607A1 (en) * 2010-09-29 2012-03-29 Kabushiki Kaisha Toshiba Speech translation apparatus, method and program
US20160071510A1 (en) * 2014-09-08 2016-03-10 Microsoft Corporation Voice generation with predetermined emotion type
CN106653000A (zh) * 2016-11-16 2017-05-10 太原理工大学 一种基于语音信息的情感强度实验方法
CN106531150A (zh) * 2016-12-23 2017-03-22 上海语知义信息技术有限公司 一种基于深度神经网络模型的情感合成方法

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112951200A (zh) * 2021-01-28 2021-06-11 北京达佳互联信息技术有限公司 语音合成模型的训练方法、装置、计算机设备及存储介质
CN112951200B (zh) * 2021-01-28 2024-03-12 北京达佳互联信息技术有限公司 语音合成模型的训练方法、装置、计算机设备及存储介质

Also Published As

Publication number Publication date
CN111192568A (zh) 2020-05-22
CN111192568B (zh) 2022-12-13
US20200357383A1 (en) 2020-11-12
US11282498B2 (en) 2022-03-22

Similar Documents

Publication Publication Date Title
WO2020098269A1 (zh) 一种语音合成方法及语音合成装置
JP7280386B2 (ja) 多言語音声合成およびクロスランゲージボイスクローニング
CN108597492B (zh) 语音合成方法和装置
JP6802005B2 (ja) 音声認識装置、音声認識方法及び音声認識システム
US11450313B2 (en) Determining phonetic relationships
WO2020073944A1 (zh) 语音合成方法及设备
EP3282368A1 (en) Parallel processing-based translation method and apparatus
CN112309366B (zh) 语音合成方法、装置、存储介质及电子设备
US11881210B2 (en) Speech synthesis prosody using a BERT model
US11488577B2 (en) Training method and apparatus for a speech synthesis model, and storage medium
JP2021196598A (ja) モデルトレーニング方法、音声合成方法、装置、電子機器、記憶媒体およびコンピュータプログラム
CN111354343B (zh) 语音唤醒模型的生成方法、装置和电子设备
KR102619408B1 (ko) 음성 합성 방법, 장치, 전자 기기 및 저장 매체
CN112309367B (zh) 语音合成方法、装置、存储介质及电子设备
CN110852075B (zh) 自动添加标点符号的语音转写方法、装置及可读存储介质
WO2023045186A1 (zh) 意图识别方法、装置、电子设备和存储介质
CN105895076B (zh) 一种语音合成方法及系统
TW201937479A (zh) 一種多語言混合語音識別方法
CN114373445B (zh) 语音生成方法、装置、电子设备及存储介质
Dandge et al. Multilingual Global Translation using Machine Learning
CN115392189B (zh) 多语种混合语料的生成方法及装置、训练方法及装置
US20230018384A1 (en) Two-Level Text-To-Speech Systems Using Synthetic Training Data
JP2023006055A (ja) プログラム、情報処理装置、方法
CN117153142A (zh) 一种语音信号合成方法、装置、电子设备及存储介质
Vivancos-Vicente et al. IXHEALTH: A Multilingual Platform for Advanced Speech Recognition in Healthcare

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19884865

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19884865

Country of ref document: EP

Kind code of ref document: A1