WO2021169825A1 - 语音合成方法、装置、设备和存储介质 - Google Patents

语音合成方法、装置、设备和存储介质 Download PDF

Info

Publication number
WO2021169825A1
WO2021169825A1 PCT/CN2021/076683 CN2021076683W WO2021169825A1 WO 2021169825 A1 WO2021169825 A1 WO 2021169825A1 CN 2021076683 W CN2021076683 W CN 2021076683W WO 2021169825 A1 WO2021169825 A1 WO 2021169825A1
Authority
WO
WIPO (PCT)
Prior art keywords
user
speech synthesis
target user
acoustic
text content
Prior art date
Application number
PCT/CN2021/076683
Other languages
English (en)
French (fr)
Inventor
黄智颖
雷鸣
Original Assignee
阿里巴巴集团控股有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 阿里巴巴集团控股有限公司 filed Critical 阿里巴巴集团控股有限公司
Publication of WO2021169825A1 publication Critical patent/WO2021169825A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/02Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/22Interactive procedures; Man-machine interfaces

Definitions

  • the present invention relates to the field of artificial intelligence technology, in particular to a speech synthesis method, device, equipment and storage medium.
  • the question answering robot in response to the user's question voice, the question answering robot can output a response voice to the user.
  • various response voices output by question answering robots often have uniform acoustic characteristics and are poor in interaction.
  • the embodiments of the present invention provide a speech synthesis method, device, equipment and storage medium, which can realize the purpose of personalized speech interaction.
  • an embodiment of the present invention provides a speech synthesis method, which includes:
  • a voice signal corresponding to the text content is generated to output the voice signal.
  • an embodiment of the present invention provides a speech synthesis device, which includes:
  • the first obtaining module is configured to obtain text content corresponding to the interactive behavior and identification information of the target user in response to the interactive behavior triggered by the user;
  • the determining module is used to determine the linguistic characteristics corresponding to the text content
  • the second acquisition module is configured to input the linguistic features and the identification information of the target user into a speech synthesis model, so as to obtain the acoustic characteristics of the target user corresponding to the text content through the speech synthesis model;
  • the generating module is configured to generate a voice signal corresponding to the text content according to the acoustic feature to output the voice signal.
  • an embodiment of the present invention provides an electronic device, including: a memory and a processor; wherein executable code is stored in the memory, and when the executable code is executed by the processor, the The processor can at least implement the speech synthesis method as described in the first aspect.
  • the embodiment of the present invention provides a non-transitory machine-readable storage medium having executable code stored on the non-transitory machine-readable storage medium, and when the executable code is executed by a processor of an electronic device,
  • the processor can at least implement the speech synthesis method as described in the first aspect.
  • a voice signal corresponding to a certain text content when it is desired to output a voice signal corresponding to a certain text content to a certain user (such as user A) with the voice of the target user (such as user B), first determine the linguistic characteristics corresponding to the text content, and then , Input the linguistic features and the identification information of the target user into the speech synthesis model, so as to obtain the acoustic characteristics of the target user corresponding to the text content through the speech synthesis model.
  • the speech synthesis model has learned the acoustic characteristics of the target user.
  • the voice synthesis model outputs a voice signal corresponding to the text content through a vocoder based on the acoustic characteristics.
  • Fig. 1 is a flowchart of a speech synthesis method provided by an embodiment of the present invention
  • FIG. 2 is a schematic diagram of a speech synthesis process using a speech synthesis model according to an embodiment of the present invention
  • FIG. 3 is a schematic diagram of a usage scenario of a speech synthesis method provided by an embodiment of the present invention.
  • FIG. 4 is a schematic flowchart of the first training stage of a speech synthesis model provided by an embodiment of the present invention
  • FIG. 5 is a schematic diagram of the training principle of the first training stage of a speech synthesis model provided by an embodiment of the present invention
  • FIG. 6 is a schematic flowchart of the second training stage of a speech synthesis model provided by an embodiment of the present invention.
  • FIG. 7 is a schematic diagram of the training principle of the second training stage of a speech synthesis model provided by an embodiment of the present invention.
  • FIG. 8 is a schematic structural diagram of a speech synthesis device provided by an embodiment of the present invention.
  • FIG. 9 is a schematic structural diagram of an electronic device corresponding to the speech synthesis device provided in the embodiment shown in FIG. 8.
  • the words “if” and “if” as used herein can be interpreted as “when” or “when” or “in response to determination” or “in response to detection”.
  • the phrase “if determined” or “if detected (statement or event)” can be interpreted as “when determined” or “in response to determination” or “when detected (statement or event) )” or “in response to detection (statement or event)”.
  • the speech synthesis method provided by the embodiment of the present invention may be executed by an electronic device, and the electronic device may be a terminal device such as a PC, a notebook computer, a smart phone, a smart robot, etc., or a server.
  • the server may be a physical server containing an independent host, or it may be a virtual server, or it may be a cloud server or a server cluster.
  • the speech synthesis method provided by the embodiments of the present invention can be applied to any scene where a voice signal needs to be output to a user, such as a scene where a user uses an intelligent robot to conduct a human-machine conversation, or a voice interaction scene where a user uses a voice assistant, etc. .
  • the above-mentioned electronic equipment may have one or more application programs supporting voice interaction functions for the majority of users to use.
  • Fig. 1 is a flowchart of a speech synthesis method provided by an embodiment of the present invention. As shown in Fig. 1, the method includes the following steps:
  • the purpose of the speech synthesis method provided by the embodiment of the present invention is to output a certain text content as the voice of a specific user (that is, the aforementioned target user).
  • the user-triggered interaction behavior described in step 101 can be understood as the behavior of the user inputting voice instructions to the APP or smart device during the process of using the APP or smart device that supports the voice interaction function.
  • the foregoing text content may be the text content determined to be output to the user A based on the interactive behavior triggered by the user A by a terminal device such as a smart robot.
  • the text content needs to be output to user A in the voice of user B.
  • user A speaks the voice "What is the weather in Beijing tomorrow?", assuming that based on the voice recognition and semantic understanding, the text content that needs to be responded to is determined as follows: tomorrow, Beijing will have a sunny day and the temperature will be between -5°C and 3°C , The northeast wind is level 1. Then, the text content will be output in user B's voice.
  • user A is an ordinary user who uses the application.
  • user A can customize the target user he needs, such as user B, so as to target user A.
  • the application program may also be configured with a certain target user by default, such as user C, so that the application program can achieve the effect of using the voice of user C to perform voice interaction with all users who use the application program.
  • the application can display a list of target users through the interface, and user A can select the target users he needs.
  • the speech synthesis model provided by the embodiment of the present invention has learned the acoustic characteristics of each target user in the target user list. Specifically, it has learned the relationship between the acoustic characteristics of each target user and the linguistic characteristics of any text content. The mapping relationship and the specific implementation process will be described in subsequent embodiments.
  • the identification information of the target user may be identification information such as user B's name and serial number.
  • the aforementioned speech synthesis model may include a front-end module, the front-end module is used to annotate the linguistic features of the text content, and the annotation process of the linguistic features can be implemented with reference to existing related technologies.
  • the linguistic features that can be marked include, but are not limited to: the pronunciation and tone of each word, the position and part of speech of each word in the text content, the rhythm, stress, and rhythm of the text content, etc.
  • the linguistic features corresponding to the text content and the identification information of the user B are input into the speech synthesis model, so as to obtain the acoustic features of the user B corresponding to the text content through the speech synthesis model.
  • the speech synthesis model includes a first encoder and a decoder, and the acoustic characteristics of user B corresponding to the text content are obtained through the speech synthesis model, which is specifically implemented as follows: Linguistic features, the linguistic features are encoded by the first encoder to obtain the first encoding vector C1 corresponding to the linguistic features, the second encoding vector C2 corresponding to the identification information of the user B is determined, and the first encoding vector C2 is determined.
  • the coding vector C1 and the second coding vector C2 are used to obtain the coding vector C3, and finally the spliced coding vector C3 is decoded by the decoder to obtain the acoustic characteristics of the user B corresponding to the above-mentioned text content.
  • the acoustic feature can be a feature that reflects the acoustic characteristics of a person’s speech speed, timbre, etc.
  • the acoustic feature can be Mel-Frequency Cepstral Coefficients (MFCC for short), Linear Prediction Cepstral Coefficients (Linear Predictive Cepstral Coefficient, referred to as LPCC), short-term average energy, average rate of change of amplitude, etc.
  • MFCC Mel-Frequency Cepstral Coefficients
  • LPCC Linear Prediction Cepstral Coefficient
  • the first encoder and decoder may be implemented as a neural network model such as a Recurrent Neural Network (RNN) model, a Long Short Term Memory (LSTM) model, etc.
  • RNN Recurrent Neural Network
  • LSTM Long Short Term Memory
  • a certain vocoder can be used to generate a voice signal corresponding to user B and the text content according to the acoustic features, that is, The voice signal of the text content is spoken with the acoustic characteristics of user B.
  • the APP determines the target text content that needs to be answered, and based on the result of user A's selection of target user B, the APP can query user B to speak the target The voice signal of the text content is output.
  • the speech synthesis solution provided by the embodiment shown in FIG. 1 can be executed.
  • the core of the purpose of realizing speech signal synthesis is to train a speech synthesis model that can learn the corresponding acoustic characteristics of different users under different linguistic characteristics, and the training cost of the speech synthesis model is low. , High accuracy. Based on the speech synthesis model, the speech synthesis task of outputting a speech signal with a specific user can be efficiently completed.
  • Fig. 3 is a schematic diagram of a usage scenario of a speech synthesis method provided by an embodiment of the present invention.
  • a certain application program (APP) supporting voice interaction function is installed in a mobile phone of a user A, such as a common Voice assistant application.
  • the user A has performed the following configuration operations on the APP in advance: simulating the voice interaction between the user B and himself.
  • the voice synthesis model has learned the acoustic features of user B corresponding to various linguistic features by collecting voice signal samples of user B.
  • APP will first input the reply content into the front-end module of the speech synthesis model to obtain the reply content through the front-end module Corresponding linguistic feature T. Furthermore, the linguistic feature T is input to the first encoder in the speech synthesis model to obtain the coding vector Ca, and the coding vector Cb corresponding to the identification information of the user B is determined, and the coding vector Ca and the coding vector Cb are spliced to obtain the coding vector Cc. .
  • the code vector Cc is input into the decoder to obtain the acoustic feature S corresponding to the reply content of the user B, and the acoustic feature S is input to the vocoder, so as to finally obtain the voice signal W output by the vocoder.
  • the waveform of the speech signal W is shown in Figure 3.
  • the speech synthesis solution provided in this article can not only be applied to the application scenario shown in FIG. 3, but also applicable to other scenarios where voice interaction with the user is performed, such as video dubbing scenarios, live broadcast scenarios, and so on.
  • the speech synthesis model can learn the corresponding acoustic characteristics of user Y under various linguistic features.
  • acoustic features corresponding to the linguistic features corresponding to the text content can be synthesized according to the predicted acoustic features to synthesize the voice signal of the same line spoken by user Y to achieve the effect of dubbing the character Z with the voice of user Y .
  • the effect of a live broadcaster with multiple different voices can be realized.
  • the anchor can configure the correspondence between multiple target users and multiple products , That is, configure which product is recommended by which target user’s voice.
  • the result of the anchor configuration is: recommend the product S with the voice of the user C, recommend the product T with the voice of the user D, and recommend the product R with the voice of the user.
  • the audio and video collection device on the anchor side collects the audio and video data of the three commodities live broadcast by the anchor and uploads it to the server.
  • the server can intercept the audio and video clips corresponding to each commodity from the uploaded audio and video data.
  • the audio and video clips of the product R recommended by the anchor can be considered to be directly provided to the viewer without modification.
  • the audio clips of the product S and product T recommended by the anchor will be processed by voice recognition (ASR) to obtain the corresponding text content, and then the text content corresponding to the product S will be synthesized by the voice synthesis method provided in the foregoing embodiment
  • ASR voice recognition
  • the voice signal of the product S recommended by the voice of the user C, and the text content corresponding to the product T is synthesized into the voice signal of the product T recommended by the voice of the user D.
  • the process of speech synthesis can be referred to the description in the foregoing embodiment, which will not be repeated here.
  • the front-end module, first encoder, and decoder in the above-mentioned speech synthesis model are used when finally used for speech synthesis of certain text content.
  • the speech synthesis model further includes a second encoder, wherein the second encoder and the first encoder share a decoder.
  • the process of training the speech synthesis model including the first encoder and the second encoder includes two stages of training, which are called the first training stage and the second training stage, respectively.
  • any training sample pair corresponding to any user is composed of a voice signal and text content corresponding to the voice signal.
  • the user does not include the target user mentioned in the foregoing embodiment. Therefore, the speech synthesis model is trained through the multiple training sample pairs corresponding to the multiple users to complete the training task of the first training stage.
  • any training sample pair of user D includes a voice signal D1 and text content D2, where the voice signal D1 is the voice of the text content D2 spoken by the user D.
  • a large number of text contents can be preset, so that different users can read all or part of the text content, and the user can record them during the reading process to obtain the above-mentioned voice signal as a pair of training samples.
  • the first training stage of the speech synthesis model may include the following steps:
  • the fourth code vector Z0 and the fifth code vector Z1 are spliced to obtain a first splicing result P1
  • the fourth code vector Z0 and the sixth code vector Z2 are spliced to obtain a second splicing result P2.
  • the function value of the loss function is determined according to the acoustic characteristic output by the decoder and the acoustic characteristic corresponding to the above-mentioned speech signal D1 as the supervision information, and the parameter adjustment of the first encoder, the second encoder and the decoder in the model is performed.
  • the phonetic posterior probability feature (Phonetic Posterior Grams, referred to as PPGs) is a matrix of time t-category y, representing the posterior probability of each pronunciation category y in each specific time frame in each audio segment, that is In other words, it represents the probability distribution of the corresponding pronunciation categories of the multiple frames of speech signals contained in a speech signal.
  • the pronunciation category refers to the smallest pronunciation unit of phoneme.
  • the composition of the speech synthesis model and the training process of the first training stage are illustrated below in conjunction with FIG. 5.
  • the input terminal of the first encoder may be connected with the front-end module described above, and the input terminal of the second encoder may be connected with an acoustic model.
  • the text content D2 is input to the front-end module, and the linguistic features corresponding to the text content D2 can be output through the front-end module.
  • the process of obtaining the posterior probability feature of the phoneme corresponding to the speech signal D1 can be implemented as follows:
  • the posterior probability feature of the phoneme corresponding to the signal D1, wherein the acoustic feature corresponding to each of the multiple frames of speech signals is used as the supervision information.
  • the linguistic feature is coded by the first encoder to obtain a fifth code vector Z1 corresponding to the linguistic feature.
  • the phoneme posterior probability feature is coded by the second encoder to obtain a sixth code vector Z2 corresponding to the phoneme posterior probability feature.
  • N the number of users collected for the first training stage
  • N the number of users collected for the first training stage
  • N the number of users collected for the first training stage
  • an N-dimensional vector can be generated as the corresponding identification information of each user Encoding vector.
  • the coding vector corresponding to user D is Z0.
  • the fourth code vector Z0 and the fifth code vector Z1 are spliced to obtain the first splicing result P1, and the fourth code vector Z0 and the sixth code vector Z2 are spliced to obtain the second splicing result P2.
  • a switch is provided at the input of the decoder. By randomly flipping the switch, it is possible to control whether the first splicing result P1 or the second splicing result P2 is input to the decoder.
  • the output of the decoder corresponding to the first splicing result P1 may be different from the output corresponding to the second splicing result P2.
  • the decoder can finally learn the mapping relationship between phoneme posterior probability features and acoustic features and the mapping relationship between linguistic features and acoustic features through training in the first training stage of a large number of training sample pairs.
  • the speech synthesis model obtained through the first training stage can be considered as a basic speech synthesis model.
  • the basic speech synthesis model also needs to learn the acoustic characteristics of these target users. Based on this, the training of the second training phase is triggered.
  • the identification information and voice signal samples corresponding to multiple target users are obtained.
  • the voice signal samples of the multiple target users are only used to train the first part of the speech synthesis model. Two encoders and decoders. It is worth noting that in the second training stage, no text content is required in the training samples.
  • the user B is any one of a plurality of target users, and it is assumed that the voice signal sample of the user B is the voice signal B1. In fact, one or more sentences spoken by user B can be obtained as a sample of user B's voice signal.
  • the second training stage of the speech synthesis model may include the following steps:
  • the function value of the loss function is determined according to the acoustic characteristic output by the decoder and the acoustic characteristic corresponding to the above-mentioned speech signal B1 as the supervision information, and the parameter adjustment of the second encoder and the decoder in the model is performed.
  • an acoustic model may be connected to the input of the second encoder.
  • the acquisition process of the acoustic features and phoneme posterior probability features corresponding to the speech signal B1 is as follows: the speech signal B1 is framed to obtain a multi-frame speech signal; the acoustic features corresponding to each of the multi-frame speech signals are extracted; The acoustic features corresponding to the frame speech signals are input into the acoustic model to predict the phoneme posterior probability feature corresponding to the speech signal B1 through the acoustic model, wherein the acoustic features corresponding to the multiple frames of speech signals are used as the supervision information.
  • the phoneme posterior probability feature is coded by the second encoder to obtain a third coding vector Z3 corresponding to the phoneme posterior probability feature.
  • the process of determining the second coding vector corresponding to the identification information of user B can be implemented as follows:
  • the second coding vector of the identification information of B is: the fourth coding vector Z0 corresponding to the identification information of the user D.
  • a user matching the attribute information of user B is found from the multiple users used in the first training stage, and the coding vector corresponding to this user is used as the coding vector corresponding to user B.
  • the attribute information may include one or more of age, gender, occupation, and location area of attribution.
  • the second code vector Z0 and the third code vector Z3 are spliced to obtain the splicing result P3.
  • the splicing result P3 is input to the decoder, and the decoder decodes and outputs the predicted acoustic features of user B.
  • the final speech synthesis model consisting of the front-end module, the first encoder, and the decoder can map the linguistic features of any text content to the acoustic features of multiple target users.
  • the final use is composed of the trained first encoder, decoder, and front-end module to use the speech synthesis model.
  • the speech synthesis device according to one or more embodiments of the present invention will be described in detail below. Those skilled in the art can understand that all of these speech synthesis devices can be configured by using commercially available hardware components through the steps taught in this solution.
  • FIG. 8 is a schematic structural diagram of a speech synthesis device provided by an embodiment of the present invention. As shown in FIG. 8, the device includes: a first acquisition module 11, a determination module 12, a second acquisition module 13, and a generation module 14.
  • the first obtaining module 11 is configured to obtain text content and target user identification information corresponding to the interactive behavior in response to the interactive behavior triggered by the user.
  • the determining module 12 is used to determine the linguistic features corresponding to the text content.
  • the second acquisition module 13 is configured to input the linguistic features and the identification information of the target user into a speech synthesis model, so as to obtain the acoustic characteristics of the target user corresponding to the text content through the speech synthesis model .
  • the generating module 14 is configured to generate a voice signal corresponding to the text content according to the acoustic feature to output the voice signal.
  • the speech synthesis model includes a first encoder and a decoder; the second acquisition module 13 may be specifically configured to: encode the linguistic features through the first encoder to obtain and The first coding vector corresponding to the linguistic feature; determining the second coding vector corresponding to the identification information of the target user; splicing the first coding vector and the second coding vector; splicing the pair by the decoder The latter code vector is decoded to obtain the acoustic feature.
  • the speech synthesis model further includes a second encoder, and the second encoder shares the decoder with the first encoder.
  • the device further includes: a first training module and a second training module.
  • the first training module is configured to obtain a voice signal sample corresponding to the target user, the voice signal sample does not correspond to the text content; determine the phoneme posterior probability feature and acoustic feature corresponding to the voice signal sample Use the acoustic features corresponding to the voice signal samples as supervision information, and input the phoneme posterior probability features corresponding to the voice signal samples and the identification information of the target user into the speech synthesis model to train the second codec And the decoder.
  • the first training module is specifically configured to: encode the posterior probability feature of the phoneme by the second encoder to obtain a third encoding vector corresponding to the posterior probability feature of the phoneme Splicing the second code vector and the third code vector corresponding to the identification information of the target user; decode the spliced code vector by the decoder to obtain the acoustic characteristics output by the decoder .
  • the first training module is specifically configured to: obtain identification information and voice signal samples corresponding to multiple users, the multiple users include the target user, and the voice signal samples of the multiple users Used for training the second encoder and the decoder; acquiring the voice signal samples corresponding to the target user from the voice signal samples corresponding to the multiple users.
  • the first training module is specifically configured to: perform framing processing on the voice signal sample to obtain a multi-frame voice signal; extract the acoustic features corresponding to each of the multi-frame voice signal; The acoustic features corresponding to each of the multiple frames of speech signals are input into the acoustic model to predict the phoneme posterior probability features corresponding to the speech signal samples through the acoustic model, wherein the respective acoustic features of the multiple frames of speech signals are used as The supervision information.
  • the second training module is configured to: obtain multiple training sample pairs corresponding to multiple users, wherein any training sample pair corresponding to any user is composed of a voice signal and text content corresponding to the voice signal , The target user is not included in the plurality of users; the speech synthesis model is trained through a plurality of training sample pairs corresponding to the plurality of users.
  • the second training module is specifically configured to: for any training sample pair corresponding to any user, obtain acoustic features and phoneme posterior probability features corresponding to the speech signal in the any training sample pair, Acquire linguistic features corresponding to the text content in any training sample pair; wherein the acoustic features corresponding to the speech signal in any training sample pair are used as supervision information; determine the identity of any user A fourth encoding vector corresponding to the information; encoding the linguistic feature by the first encoder to obtain a fifth encoding vector corresponding to the linguistic feature; encoding the phoneme by the second encoder
  • the posterior probability feature is encoded to obtain a sixth code vector corresponding to the phoneme posterior probability feature; the fourth code vector and the fifth code vector are spliced to obtain the first splicing result, and the fourth code vector is spliced.
  • the coding vector and the sixth coding vector are used to obtain a second splicing result; the first splicing result or the second splicing result is decode
  • the second training module is specifically configured to: obtain the attribute information of the any user and the attribute information of the target user; if the attribute information of the target user matches the attribute information of the any user , It is determined that the second coding vector corresponding to the identification information of the target user is: a fourth coding vector corresponding to the identification information of any user.
  • the device shown in FIG. 8 can execute the speech synthesis method provided in the foregoing embodiments shown in FIG. 1 to FIG.
  • the structure of the speech synthesis apparatus shown in FIG. 8 may be implemented as an electronic device.
  • the electronic device may include a processor 21 and a memory 22.
  • executable code is stored on the memory 22, and when the executable code is executed by the processor 21, the processor 21 can at least implement the speech synthesis method provided in the embodiments shown in FIGS. 1 to 7 above.
  • the electronic device may also include a communication interface 23 for communicating with other devices.
  • an embodiment of the present invention provides a non-transitory machine-readable storage medium having executable code stored on the non-transitory machine-readable storage medium, and when the executable code is executed by a processor of an electronic device , So that the processor can at least implement the speech synthesis method provided in the foregoing embodiments shown in FIG. 1 to FIG. 7.
  • each implementation manner can be implemented by adding a necessary general hardware platform, and of course, it can also be implemented by a combination of hardware and software.
  • the above technical solution essentially or the part that contributes to the prior art can be embodied in the form of a computer product, and the present invention can be used in one or more computer usable storage containing computer usable program codes.
  • the form of a computer program product implemented on a medium including but not limited to disk storage, CD-ROM, optical storage, etc.).
  • the speech synthesis method provided by the embodiment of the present invention can be executed by a certain program/software, and the program/software can be provided by the network side.
  • the electronic device mentioned in the foregoing embodiment can download the program/software to a local non-easy
  • the program/software is read into the memory by the CPU, and then the program/software is executed by the CPU to realize the speech synthesis method provided in the aforementioned embodiment.
  • the execution process please refer to the schematic diagrams in Figures 1 to 7 described above.

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Computational Linguistics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Machine Translation (AREA)

Abstract

一种语音合成方法、装置、电子设备和存储介质,该方法包括:响应于用户触发的交互行为,获取与该交互行为对应的文本内容和目标用户的标识信息(101);确定文本内容对应的语言学特征(102);将语言学特征和目标用户的标识信息输入到语音合成模型中,以通过语音合成模型获得目标用户与文本内容对应的声学特征(103);根据该声学特征生成目标用户与该文本内容对应的语音信号并输出(104)。该方法可以实现以特定某人的声音与某用户进行个性化的语音交互的目的。

Description

语音合成方法、装置、设备和存储介质 技术领域
本发明涉及人工智能技术领域,尤其涉及一种语音合成方法、装置、设备和存储介质。
背景技术
随着人工智能技术的发展,各种支持语音交互的应用程序层出不穷,比如各种问答机器人、智能音箱,等等。
以问答机器人为例,响应于用户的问题语音,问答机器人可以向用户输出应答语音。目前,问答机器人输出的各种应答语音往往都是具有统一的声学特征的,互动性较差。
发明内容
本发明实施例提供一种语音合成方法、装置、设备和存储介质,可以实现个性化的语音交互目的。
第一方面,本发明实施例提供一种语音合成方法,该方法包括:
响应于用户触发的交互行为,获取与所述交互行为对应的文本内容和目标用户的标识信息;
确定所述文本内容对应的语言学特征;
将所述语言学特征和所述目标用户的标识信息输入到语音合成模型中,以通过所述语音合成模型获得所述目标用户与所述文本内容对应的声学特征;
根据所述声学特征,生成与所述文本内容对应的语音信号,以输出所述语音信号。
第二方面,本发明实施例提供一种语音合成装置,该装置包括:
第一获取模块,用于响应于用户触发的交互行为,获取与所述交互行为对应的文本内容和目标用户的标识信息;
确定模块,用于确定所述文本内容对应的语言学特征;
第二获取模块,用于将所述语言学特征和所述目标用户的标识信息输入到语音合成模型中,以通过所述语音合成模型获得所述目标用户与所述文本内容对应的声学特征;
生成模块,用于根据所述声学特征,生成与所述文本内容对应的语音信号,以输出所述语音信号。
第三方面,本发明实施例提供一种电子设备,包括:存储器、处理器;其中,所述存储器上存储有可执行代码,当所述可执行代码被所述处理器执行时,使所述处理器至少可以实现如第一方面所述的语音合成方法。
本发明实施例提供了一种非暂时性机器可读存储介质,所述非暂时性机器可读存储介质上存储有可执行代码,当所述可执行代码被电子设备的处理器执行时,使所述处理器至少可以实现如第一方面所述的语音合成方法。
在本发明实施例中,当希望以目标用户(如用户B)的声音向某用户(如用户A)输出对应于某文本内容的语音信号时,先确定该文本内容对应的语言学特征,进而,将该语言学特征和目标用户的标识信息输入到语音合成模型中,以通过语音合成模型获得目标用户与该文本内容对应的声学特征。其中,该语音合成模型已经学习到了目标用户的声学特征。 最终,语音合成模型输出的根据声学特征,通过声码器(vocoder)生成与文本内容对应的语音信号。通过该方案,可以实现以特定某人的声音与某用户进行个性化的语音交互的目的。
附图说明
为了更清楚地说明本发明实施例中的技术方案,下面将对实施例描述中所需要使用的附图作一简单地介绍,显而易见地,下面描述中的附图是本发明的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1为本发明实施例提供的一种语音合成方法的流程图;
图2为本发明实施例提供的一种使用语音合成模型进行语音合成过程的示意图;
图3为本发明实施例提供的一种语音合成方法的使用场景的示意图;
图4为本发明实施例提供的一种语音合成模型的第一训练阶段的流程示意图;
图5为本发明实施例提供的一种语音合成模型的第一训练阶段的训练原理示意图;
图6为本发明实施例提供的一种语音合成模型的第二训练阶段的流程示意图;
图7为本发明实施例提供的一种语音合成模型的第二训练阶段的训练原理示意图;
图8为本发明实施例提供的一种语音合成装置的结构示意图;
图9为与图8所示实施例提供的语音合成装置对应的电子设备的结构示意图。
具体实施方式
为使本发明实施例的目的、技术方案和优点更加清楚,下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本发明一部分实施例,而不是全部的实施例。基于本发明中的实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都属于本发明保护的范围。
在本发明实施例中使用的术语是仅仅出于描述特定实施例的目的,而非旨在限制本发明。在本发明实施例和所附权利要求书中所使用的单数形式的“一种”、“所述”和“该”也旨在包括多数形式,除非上下文清楚地表示其他含义,“多种”一般包含至少两种。
取决于语境,如在此所使用的词语“如果”、“若”可以被解释成为“在……时”或“当……时”或“响应于确定”或“响应于检测”。类似地,取决于语境,短语“如果确定”或“如果检测(陈述的条件或事件)”可以被解释成为“当确定时”或“响应于确定”或“当检测(陈述的条件或事件)时”或“响应于检测(陈述的条件或事件)”。
另外,下述各方法实施例中的步骤时序仅为一种举例,而非严格限定。
本发明实施例提供的语音合成方法可以由一电子设备来执行,该电子设备可以是诸如PC机、笔记本电脑、智能手机、智能机器人等终端设备,也可以是服务器。该服务器可以是包含一独立主机的物理服务器,或者也可以为虚拟服务器,或者也可以为云端服务器或服务器集群。
本发明实施例提供的语音合成方法可以适用于任何需要向用户输出语音信号的场景中,比如用户使用智能机器人进行人机对话的场景中,再比如用户使用语音助手的语音交互场景中,等等。基于此,上述电子 设备中可以具有支持语音交互功能的一种或多种应用程序,以供广大用户使用。
下面结合以下实施例对本文提供的语音合成方法的执行过程进行示例性说明。
图1为本发明实施例提供的一种语音合成方法的流程图,如图1所示,该方法包括如下步骤:
101、响应于用户触发的交互行为,获取与该交互行为对应的文本内容和目标用户的标识信息。
102、确定文本内容对应的语言学特征。
103、将语言学特征和目标用户的标识信息输入到语音合成模型中,以通过语音合成模型获得目标用户与文本内容对应的声学特征。
104、根据所述声学特征,生成与文本内容对应的语音信号,以输出该语音信号。
概括来说,本发明实施例提供的语音合成方法的目的是:将某文本内容以特定用户(即上述目标用户)的声音输出。
实际应用中,上述步骤101中所述的用户触发的交互行为,可以理解为用户在使用支持语音交互功能的APP或智能设备的过程中向该APP或智能设备输入语音指令等行为。
以人机对话场景为例,上述文本内容可以是诸如智能机器人等终端设备基于用户A触发的交互行为,确定出的需要向该用户A输出的文本内容。假设目标用户为用户B,则需要以用户B的声音向用户A输出该文本内容。比如,用户A说出“明天北京天气怎么样”的语音,假设基于对该语音进行语音识别和语义理解后确定出需要响应的文本内容为:明天北京天气晴朗,气温在-5℃至3℃,东北风1级。则最终会以用户B的声音输出该文本内容。
以支持语音交互功能的某应用程序为例来说,假设用户A为使用该应用程序的普通用户,可选地,用户A可以定制自己所需的目标用户,如用户B,从而实现针对用户A的个性化的语音交互目的。或者,可选地,该应用程序也可以默认配置某目标用户,如用户C,从而该应用程序能够实现以用户C的声音与使用该应用程序的所有用户进行语音交互的效果。
在实际应用中,该应用程序可以通过界面显示出目标用户列表,用户A可以从中选择自己所需的目标用户。其中,本发明实施例提供的语音合成模型已经学习到了该目标用户列表中的各个目标用户的声学特征,具体地,学习到了各目标用户的声学特征与任一文本内容的语言学特征之间的映射关系,具体地实现过程将在后续实施例中说明。
下面以用户A选择的目标用户为用户B为例进行说明,此时,目标用户的标识信息可以是用户B的姓名、编号等标识信息。
在确定出需要向用户A输出的文本内容后,首先,确定该文本内容对应的语言学特征。可选地,上述语音合成模型中可以包括前端模块,该前端模块用于标注该文本内容的语言学特征,该语言学特征的标注过程可以参考现有相关技术实现。实际应用中,可以标注的语言学特征包括但不限于:每个字的发音、声调,每个词语在文本内容中的位置、词性,文本内容的韵律、重音、节奏,等等。
进而,将文本内容对应的语言学特征和用户B的标识信息输入到语音合成模型中,以通过语音合成模型获得用户B与文本内容对应的声学特征。具体地,如图2所示,该语音合成模型中包括第一编码器和解码器,通过语音合成模型获得用户B与文本内容对应的声学特征,具体实现为:通过前端模块标注该文本内容的语言学特征,通过第一编码器对语言学特征进行编码,以得到与语言学特征对应的第一编码向量C1,确定与用户B的标识信息对应的第二编码向量C2,拼接所述第一编码向量 C1与第二编码向量C2以得到编码向量C3,最终通过解码器对拼接后的编码向量C3进行解码,以得到用户B与上述文本内容对应的声学特征。
声学特征可以是反映人的语速、音色等声学特点的特征,可选地,该声学特征可以是梅尔频率倒谱系数(Mel-Frequency Cepstral Coefficients,简称MFCC)、线性预测倒谱系数(Linear Predictive Cepstral Coefficient,简称LPCC)、短时平均能量、振幅平均变化率,等等。
其中,可选地,第一编码器和解码器可以实现为诸如循环神经网络(Recurrent Neural Network,简称RNN)模型、长短时记忆网络(Long Short Term Memory,简称LSTM)模型等神经网络模型。
最终,在通过语音合成模型得到用户B与文本内容对应的声学特征后,进而可以通过某种声码器(vocoder)来根据该声学特征生成与用户B与该文本内容对应的语音信号,即得到以用户B的声学特征说出该文本内容的语音信号。从而,完成了以用户B的发音对用户A说出某文本内容的任务。
值得说明的是,以用户A在使用某APP的人机对话场景为例,假设用户A设置的目标用户为用户B,并且,假设该APP中已经预先存储有以用户B的声音输出各种文本内容的语音信号,而这些文本内容恰好是人机对话过程中,APP能够回复用户的文本内容(即可以认为是回复模板)。也就是说,在APP中可以预先存储有多个特定的用户分别说出多个文本内容的语音信号。基于此假设,当用户A说出某句咨询语句后,该APP在确定出需要应答的目标文本内容后,基于用户A对目标用户B的选择结果,该APP可以查询到用户B说出该目标文本内容的语音信号以输出。但是,当上述多个特定的用户中并没有满足当前的用户A的需求的目标用户时,此时,可以执行图1所示实施例提供的语音合成方案。
在上述本发明实施例提供的方案中,实现语音信号合成的目的核心在于,训练出一个能够学习不同用户在不同语言学特征下对应的声学特征的语音合成模型,且该语音合成模型训练成本低,准确率高。基于该语音合成模型,可以高效地完成以特定用户输出语音信号的语音合成任务。
为便于理解,下面结合图3来示例性说明上述语音合成方法在实际应用中的执行过程。
图3为本发明实施例提供的一种语音合成方法的使用场景的示意图,在图3中,假设某用户A的手机中安装有支持语音交互功能的某种应用程序(APP),如常见的语音助手应用。假设该用户A预先已经对该APP进行了如下配置操作:模拟用户B与本人进行语音交互。假设已经通过收集用户B的语音信号样本使得上述语音合成模型学习到了用户B与各种语言学特征对应的声学特征。
基于此,假设用户A当前对该APP说出了“请问明天北京天气怎么样”的语音信号,假设该APP当前需要回复的内容为:天气晴朗,温度二十度。若用户A已经配置了想要APP模拟用户B来与其交互,则如图3中所示,APP会先将该回复内容输入到语音合成模型的前端模块中,以通过前端模块得到与该回复内容对应的语言学特征T。进而,将语言学特征T输入到语音合成模型中的第一编码器,以得到编码向量Ca,以及确定用户B的标识信息对应的编码向量Cb,拼接编码向量Ca和编码向量Cb得到编码向量Cc。将编码向量Cc输入到解码器中得到用户B与该回复内容对应的声学特征S,声学特征S输入到声码器,从而最终得到声码器输出的语音信号W。语音信号W的波形如图3中所示。
当然,实际应用中,本文提供的语音合成方案不仅可以适用于图3所示的应用场景中,还可以适用于其他与用户进行语音交互的场景中,比如视频配音场景、直播场景,等等。
在视频配音场景中,以某一段视频片段为例,假设该视频片段中包含人物角色Z,假设原本该人物角色Z的台词都是以用户X的声音进行配音的,现在想要以用户Y的声音为该人物角色Z配音。此时,人物角色Z的台词即对应于前述实施例中的文本内容,目标用户为用户Y。基于上文提到的语音合成模型的训练过程,可以使得该语音合成模型能够学习到用户Y在各种语言学特征下对应的声学特征,从而,基于该语音合成模型可以预测出用户Y在上述文本内容(即台词)对应的语言学特征下所对应的声学特征,根据预测出的声学特征可以合成用户Y说出同样的台词的语音信号,实现以用户Y的声音为人物角色Z配音的效果。
在直播场景中,基于本发明实施例提供的语音合成方案,可以实现一个主播以多种不同的声音进行直播的效果。举例来说,假设某主播在直播间向广大观看者推荐多种商品,该主播想要以不同的声音推荐不同的商品,此时,该主播可以配置多个目标用户与多个商品的对应关系,即配置以哪个目标用户的声音推荐哪个商品。假设主播配置的结果是:以用户C的声音推荐商品S,以用户D的声音推荐商品T,以自己的声音推荐商品R。基于此,在实际的直播过程中,主播侧的音视频采集设备采集该主播直播这三个商品的音视频数据,并上传至服务端。服务端基于该主播的配置结果,可以从上传的音视频数据中截取出每个商品对应的音视频片段。其中,该主播推荐商品R的音视频片段可以认为是无更改地直接提供给观看者。其中,该主播推荐商品S和商品T的音频片段将会先经语音识别(ASR)处理,得到对应的文本内容,之后通过前述实施例提供的语音合成方法,将与商品S对应的文本内容合成为由用户C的声 音推荐商品S的语音信号,以及将与商品T对应的文本内容合成为由用户D的声音推荐商品T的语音信号。语音合成的过程可以参见前述实施例中的说明,在此不赘述。
下面对上文提到的语音合成模型的训练过程进行说明。
需要说明的是,随着最终用于对某文本内容进行语音合成时使用到的是上述语音合成模型中的前端模块、第一编码器、解码器,但是,实际上,为了得到训练至收敛的上述第一编码器、解码器,在训练过程中,该语音合成模型中还包括第二编码器,其中,第二编码器与第一编码器共用解码器。
在对包含第一编码器和第二编码器的语音合成模型进行训练的过程中,包括两个阶段的训练,分别称为第一训练阶段和第二训练阶段。
在第一训练阶段中,需要获取多个用户对应的多个训练样本对,其中,任一用户对应的任一训练样本对由语音信号和与该语音信号对应的文本内容组成,其中,该多个用户中不包括前述实施例中提到的目标用户。从而,通过该多个用户对应的多个训练样本对训练语音合成模型,以完成第一训练阶段的训练任务。
以上述多个用户中的用户D为例进行说明,用户D为其中任一用户。并且,假设用户D的任一训练样本对中包括语音信号D1和文本内容D2,其中,语音信号D1为用户D说出文本内容D2的语音。
实际上,可以预先设定众多文本内容,让不同的用户读出其中的全部或部分数量的文本内容,在用户读的过程中,对其进行录音,以得到作为训练样本对的上述语音信号。
以用户D对应的语音信号D1和文本内容D2为例,如图4所示,语音合成模型的第一训练阶段可以包括如下步骤:
401、获取与语音信号D1对应的声学特征和音素后验概率特征,获取与文本内容D2对应的语言学特征;其中,与语音信号D1对应的声学特征作为监督信息。
402、确定与用户D的标识信息对应的第四编码向量Z0。
403、通过第一编码器对语言学特征进行编码,以得到与语言学特征对应的第五编码向量Z1;通过第二编码器对音素后验概率特征进行编码,以得到与音素后验概率特征对应的第六编码向量Z2。
404、拼接第四编码向量Z0与第五编码向量Z1以得到第一拼接结果P1,拼接第四编码向量Z0与第六编码向量Z2以得到第二拼接结果P2。
405、通过解码器对第一拼接结果P1或第二拼接结果P2进行解码,以得到解码器输出的声学特征。
最终,根据解码器输出的声学特征与作为监督信息的上述语音信号D1对应的声学特征,确定损失函数的函数值,进行模型中第一编码器、第二编码器和解码器的参数调整。
其中,音素后验概率特征(Phonetic Posterior Grams,简称PPGs),是一个时间t-类别y的矩阵,表示每个发音类别y在每段音频中的每个特定时间帧的后验概率,也即是说,表示一段语音信号中包含的多帧语音信号各自对应的发音类别的概率分布。其中,发音类别是指音素这个最小的发音单位。
为便于理解,下面结合图5示例性说明语音合成模型的组成和第一训练阶段的训练过程。如图5中所示,第一编码器的输入端可以连接有前文所述的前端模块,第二编码器的输入端可以连接有声学模型。
具体地,将文本内容D2输入到前端模块,可以通过前端模块输出文本内容D2对应的语言学特征。
获取与语音信号D1对应的音素后验概率特征的过程可以实现为:
对语音信号D1进行分帧处理,以得到多帧语音信号;提取多帧语音信号各自对应的声学特征;将多帧语音信号各自对应的声学特征输入到声学模型中,以通过声学模型预测出语音信号D1对应的音素后验概率特征,其中,该多帧语音信号各自对应的声学特征即作为监督信息。
进而,通过第一编码器对语言学特征进行编码,以得到与语言学特征对应的第五编码向量Z1。通过第二编码器对音素后验概率特征进行编码,以得到与音素后验概率特征对应的第六编码向量Z2。
实际应用中,假设收集到的用于第一训练阶段的用户数量为N,N大于1,针对每个用户,可选地,可以生成一个N维的向量,作为每个用户的标识信息对应的编码向量。其中,假设用户D对应的编码向量为Z0。
拼接第四编码向量Z0与第五编码向量Z1以得到第一拼接结果P1,拼接第四编码向量Z0与第六编码向量Z2以得到第二拼接结果P2。
如图5中所示,可以认为在解码器的输入端设置有一个开关,通过随机拨动该开关,可以控制输入到解码器的是第一拼接结果P1还是第二拼接结果P2。解码器对应于第一拼接结果P1的输出与对应于第二拼接结果P2的输出可能有所不同。
实际上,以用户D对应的语音信号D1和文本内容D2为例,这一对训练样本会反复被作为输入,以用于第一训练阶段的训练过程。从而,通过多次反复输入,该对训练样本对应的第一拼接结果P1和第二拼接结果P2最终可能都会被输入到解码器中。基于此,通过大量训练样本对的第一训练阶段的训练,解码器最终可以学习到音素后验概率特征与声学特征的映射关系以及语言学特征与声学特征的映射关系。
经过第一训练阶段得到的语音合成模型可以认为是一种基础的语音合成模型,当需要能够通过该语音合成模型合成少量的目标用户(与第一训练阶段所使用到的多个用户不同的用户)语音信号时,还需要让该 基础的语音合成模型学习这些目标用户的声学特征。基于此,触发第二训练阶段的训练。
在第二训练阶段中,当目标用户的数量为多个时,获取多个目标用户对应的标识信息和语音信号样本,该多个目标用户的语音信号样本仅用于训练语音合成模型中的第二编码器和解码器。值得说明的是,在第二训练阶段,训练样本中无需文本内容。
仍以前述实施例中的目标用户B为例,假设用户B为多个目标用户中的任一个,并假设用户B的语音信号样本为语音信号B1。实际上,可以获取用户B随意说出的一句或多句语音作为用户B的语音信号样本。
以用户B对应的语音信号B1为例,如图6所示,语音合成模型的第二训练阶段可以包括如下步骤:
601、获取与用户B对应的语音信号样本。
602、确定语音信号样本对应的音素后验概率特征和声学特征。
603、以语音信号样本对应的声学特征作为监督信息,确定与用户B的标识信息对应的第二编码向量,通过第二编码器对音素后验概率特征进行编码,以得到与音素后验概率特征对应的第三编码向量。
604、拼接第二编码向量和第三编码向量。
605、通过解码器对拼接后的编码向量进行解码,以得到解码器输出的声学特征。
最终,根据解码器输出的声学特征与作为监督信息的上述语音信号B1对应的声学特征,确定损失函数的函数值,进行模型中第二编码器和解码器的参数调整。
为便于理解,下面结合图7示例性说明语音合成模型第二训练阶段的训练过程。如图7中所示,第二编码器的输入端可以连接有声学模型。
具体地,语音信号B1对应的声学特征和音素后验概率特征的获取过程如下:对语音信号B1进行分帧处理,以得到多帧语音信号;提取多帧语音信号各自对应的声学特征;将多帧语音信号各自对应的声学特征输入到声学模型中,以通过声学模型预测出语音信号B1对应的音素后验概率特征,其中,多帧语音信号各自对应的声学特征作为所述监督信息。
进而,通过第二编码器对音素后验概率特征进行编码,以得到与音素后验概率特征对应的第三编码向量Z3。
在第二训练阶段,与用户B的标识信息对应的第二编码向量的确定过程可以实现为:
获取第一训练阶段使用的多个用户的属性信息和用户B的属性信息;若用户B的属性信息与该多个用户中任一用户(假设用户D)的属性信息匹配,则确定对应于用户B的标识信息的第二编码向量为:与用户D的标识信息对应的第四编码向量Z0。
简单来说就是:从第一训练阶段使用的多个用户中找到一个与用户B的属性信息相匹配的用户,以该用户对应的编码向量作为用户B对应的编码向量。
其中,属性信息可以包括年龄、性别、职业、归属的位置区域等一种或多种。
以用户B的标识信息对应的第二编码向量为Z0为例,拼接第二编码向量Z0与第三编码向量Z3以得到拼接结果P3。拼接结果P3输入到解码器中,解码器解码输出预测到的用户B的声学特征。
与图5所示实施例相比,可以认为在图7中,上述开关始终保持与第二编码器所在的分支导通的状态。
通过上述两个阶段的训练,使得最终由前端模块、第一编码器、解码器组成的语音合成模型能够将任意文本内容的语言学特征映射为多个 目标用户的声学特征。也即是说,在实际需要进行语音合成的应用场景中,最终使用到的是由训练后的第一编码器、解码器以及前端模块组成使用该语音合成模型。
最后,值得说明的是,为提高语音合成模型的准确度,可选地,在解码器与第一编码器之间以及解码器与第二编码器之间,还可以采用自注意力机制(Attention),从而,最终经过训练还会得到与第一编码器对应的注意力参数以及与第二编码器对应的注意力参数。
以下将详细描述本发明的一个或多个实施例的语音合成装置。本领域技术人员可以理解,这些语音合成装置均可使用市售的硬件组件通过本方案所教导的步骤进行配置来构成。
图8为本发明实施例提供的一种语音合成装置的结构示意图,如图8所示,该装置包括:第一获取模块11、确定模块12、第二获取模块13、生成模块14。
第一获取模块11,用于响应于用户触发的交互行为,获取与所述交互行为对应的文本内容和目标用户的标识信息。
确定模块12,用于确定所述文本内容对应的语言学特征。
第二获取模块13,用于将所述语言学特征和所述目标用户的标识信息输入到语音合成模型中,以通过所述语音合成模型获得所述目标用户与所述文本内容对应的声学特征。
生成模块14,用于根据所述声学特征,生成与所述文本内容对应的语音信号,以输出所述语音信号。
可选地,所述语音合成模型中包括第一编码器和解码器;所述第二获取模块13具体可以用于:通过所述第一编码器对所述语言学特征进行编码,以得到与所述语言学特征对应的第一编码向量;确定与所述目标用户的标识信息对应的第二编码向量;拼接所述第一编码向量与所述第 二编码向量;通过所述解码器对拼接后的编码向量进行解码,以得到所述声学特征。
可选地,所述语音合成模型中还包括第二编码器,所述第二编码器与所述第一编码器共用所述解码器。
基于此,所述装置还包括:第一训练模块和第二训练模块。
所述第一训练模块,用于获取与所述目标用户对应的语音信号样本,所述语音信号样本不对应于所述文本内容;确定所述语音信号样本对应的音素后验概率特征和声学特征;以所述语音信号样本对应的声学特征作为监督信息,将所述语音信号样本对应的音素后验概率特征和所述目标用户的标识信息输入到语音合成模型中,以训练所述第二编码器和所述解码器。
其中,可选地,所述第一训练模块具体用于:通过所述第二编码器对所述音素后验概率特征进行编码,以得到与所述音素后验概率特征对应的第三编码向量;拼接对应于所述目标用户的标识信息的所述第二编码向量和所述第三编码向量;通过所述解码器对拼接后的编码向量进行解码,以得到所述解码器输出的声学特征。
其中,可选地,所述第一训练模块具体用于:获取多个用户对应的标识信息和语音信号样本,所述多个用户中包括所述目标用户,所述多个用户的语音信号样本用于训练所述第二编码器和所述解码器;从所述多个用户对应的语音信号样本中获取所述目标用户对应的语音信号样本。
其中,可选地,所述第一训练模块具体用于:对所述语音信号样本进行分帧处理,以得到多帧语音信号;提取所述多帧语音信号各自对应的声学特征;将所述多帧语音信号各自对应的声学特征输入到声学模型 中,以通过所述声学模型预测出所述语音信号样本对应的音素后验概率特征,其中,所述多帧语音信号各自对应的声学特征作为所述监督信息。
可选地,所述第二训练模块用于:获取多个用户对应的多个训练样本对,其中,任一用户对应的任一训练样本对由语音信号和所述语音信号对应的文本内容组成,所述多个用户中不包括所述目标用户;通过所述多个用户对应的多个训练样本对训练所述语音合成模型。
可选地,所述第二训练模块具体用于:对于任一用户对应的任一训练样本对,获取与所述任一训练样本对中的语音信号对应的声学特征和音素后验概率特征,获取与所述任一训练样本对中的文本内容对应的语言学特征;其中,与所述任一训练样本对中的语音信号对应的声学特征作为监督信息;确定与所述任一用户的标识信息对应的第四编码向量;通过所述第一编码器对所述语言学特征进行编码,以得到与所述语言学特征对应的第五编码向量;通过所述第二编码器对所述音素后验概率特征进行编码,以得到与所述音素后验概率特征对应的第六编码向量;拼接所述第四编码向量与所述第五编码向量以得到第一拼接结果,拼接所述第四编码向量与所述第六编码向量以得到第二拼接结果;通过所述解码器对第一拼接结果或第二拼接结果进行解码,以得到所述解码器输出的声学特征。
可选地,所述第二训练模块具体用于:获取所述任一用户的属性信息和所述目标用户的属性信息;若所述目标用户的属性信息与所述任一用户的属性信息匹配,则确定对应于所述目标用户的标识信息的所述第二编码向量为:与所述任一用户的标识信息对应的第四编码向量。
图8所示装置可以执行前述图1至图7所示实施例中提供的语音合成方法,详细的执行过程和技术效果参见前述实施例中的描述,在此不再赘述。
在一个可能的设计中,上述图8所示语音合成装置的结构可实现为一电子设备,如图9所示,该电子设备可以包括:处理器21、存储器22。其中,存储器22上存储有可执行代码,当所述可执行代码被处理器21执行时,使处理器21至少可以实现如前述图1至图7所示实施例中提供的语音合成方法。
可选地,该电子设备中还可以包括通信接口23,用于与其他设备进行通信。
另外,本发明实施例提供了一种非暂时性机器可读存储介质,所述非暂时性机器可读存储介质上存储有可执行代码,当所述可执行代码被电子设备的处理器执行时,使处理器至少可以实现如前述图1至图7所示实施例中提供的语音合成方法。
以上所描述的装置实施例仅仅是示意性的,其中所述作为分离部件说明的单元可以是或者也可以不是物理上分开的。可以根据实际的需要选择其中的部分或者全部模块来实现本实施例方案的目的。本领域普通技术人员在不付出创造性的劳动的情况下,即可以理解并实施。
通过以上的实施方式的描述,本领域的技术人员可以清楚地了解到各实施方式可借助加必需的通用硬件平台的方式来实现,当然也可以通过硬件和软件结合的方式来实现。基于这样的理解,上述技术方案本质上或者说对现有技术做出贡献的部分可以以计算机产品的形式体现出来,本发明可采用在一个或多个其中包含有计算机可用程序代码的计算机可用存储介质(包括但不限于磁盘存储器、CD-ROM、光学存储器等)上实施的计算机程序产品的形式。
本发明实施例提供的语音合成方法可以由某种程序/软件来执行,该程序/软件可以由网络侧提供,前述实施例中提及的电子设备可以将该程序/软件下载到本地的非易失性存储介质中,并在其需要执行前述语音合 成方法时,通过CPU将该程序/软件读取到内存中,进而由CPU执行该程序/软件以实现前述实施例中所提供的语音合成方法,执行过程可以参见前述图1至图7中的示意。
最后应说明的是:以上实施例仅用以说明本发明的技术方案,而非对其限制;尽管参照前述实施例对本发明进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本发明各实施例技术方案的精神和范围。

Claims (12)

  1. 一种语音合成方法,其特征在于,包括:
    响应于用户触发的交互行为,获取与所述交互行为对应的文本内容和目标用户的标识信息;
    确定所述文本内容对应的语言学特征;
    将所述语言学特征和所述目标用户的标识信息输入到语音合成模型中,以通过所述语音合成模型获得所述目标用户与所述文本内容对应的声学特征;
    根据所述声学特征,生成与所述文本内容对应的语音信号,以输出所述语音信号。
  2. 根据权利要求1所述的方法,其特征在于,所述语音合成模型中包括第一编码器和解码器;
    所述通过所述语音合成模型获得所述目标用户与所述文本内容对应的声学特征,包括:
    通过所述第一编码器对所述语言学特征进行编码,以得到与所述语言学特征对应的第一编码向量;
    确定与所述目标用户的标识信息对应的第二编码向量;
    拼接所述第一编码向量与所述第二编码向量;
    通过所述解码器对拼接后的编码向量进行解码,以得到所述声学特征。
  3. 根据权利要求2所述的方法,其特征在于,所述语音合成模型中还包括第二编码器,所述第二编码器与所述第一编码器共用所述解码器;
    所述方法还包括:
    获取与所述目标用户对应的语音信号样本,所述语音信号样本不对应于所述文本内容;
    确定所述语音信号样本对应的音素后验概率特征和声学特征;
    以所述语音信号样本对应的声学特征作为监督信息,将所述语音信号样本对应的音素后验概率特征和所述目标用户的标识信息输入到语音合成模型中,以训练所述第二编码器编码器和所述解码器。
  4. 根据权利要求3所述的方法,其特征在于,所述将所述语音信号样本对应的音素后验概率特征和所述目标用户的标识信息输入到语音合成模型中,以训练所述第二编码器和所述解码器,包括:
    通过所述第二编码器对所述音素后验概率特征进行编码,以得到与所述音素后验概率特征对应的第三编码向量;
    拼接对应于所述目标用户的标识信息的所述第二编码向量和所述第三编码向量;
    通过所述解码器对拼接后的编码向量进行解码,以得到所述解码器输出的声学特征。
  5. 根据权利要求3所述的方法,其特征在于,所述获取与所述目标用户对应的语音信号样本,包括:
    获取多个用户对应的标识信息和语音信号样本,所述多个用户中包括所述目标用户,所述多个用户的语音信号样本用于训练所述第二编码器和所述解码器;
    从所述多个用户对应的语音信号样本中获取所述目标用户对应的语音信号样本。
  6. 根据权利要求3所述的方法,其特征在于,所述确定所述语音信号样本对应的音素后验概率特征和声学特征,包括:
    对所述语音信号样本进行分帧处理,以得到多帧语音信号;
    提取所述多帧语音信号各自对应的声学特征;
    将所述多帧语音信号各自对应的声学特征输入到声学模型中,以通过所述声学模型预测出所述语音信号样本对应的音素后验概率特征,其中,所述多帧语音信号各自对应的声学特征作为所述监督信息。
  7. 根据权利要求3所述的方法,其特征在于,所述方法还包括:
    获取多个用户对应的多个训练样本对,其中,任一用户对应的任一训练样本对由语音信号和所述语音信号对应的文本内容组成,所述多个用户中不包括所述目标用户;
    通过所述多个用户对应的多个训练样本对训练所述语音合成模型。
  8. 根据权利要求7所述的方法,其特征在于,所述通过所述多个用户对应的多个训练样本对训练所述语音合成模型,包括:
    对于任一用户对应的任一训练样本对,获取与所述任一训练样本对中的语音信号对应的声学特征和音素后验概率特征,获取与所述任一训练样本对中的文本内容对应的语言学特征;其中,与所述任一训练样本对中的语音信号对应的声学特征作为监督信息;
    确定与所述任一用户的标识信息对应的第四编码向量;
    通过所述第一编码器对所述语言学特征进行编码,以得到与所述语言学特征对应的第五编码向量;通过所述第二编码器对所述音素后验概率特征进行编码,以得到与所述音素后验概率特征对应的第六编码向量;
    拼接所述第四编码向量与所述第五编码向量以得到第一拼接结果,拼接所述第四编码向量与所述第六编码向量以得到第二拼接结果;
    通过所述解码器对第一拼接结果或第二拼接结果进行解码,以得到所述解码器输出的声学特征。
  9. 根据权利要求8所述的方法,其特征在于,所述方法还包括:
    获取所述任一用户的属性信息和所述目标用户的属性信息;
    若所述目标用户的属性信息与所述任一用户的属性信息匹配,则确定对应于所述目标用户的标识信息的所述第二编码向量为:与所述任一用户的标识信息对应的第四编码向量。
  10. 一种语音合成装置,其特征在于,包括:
    第一获取模块,用于响应于用户触发的交互行为,获取与所述交互行为对应的文本内容和目标用户的标识信息;
    确定模块,用于确定所述文本内容对应的语言学特征;
    第二获取模块,用于将所述语言学特征和所述目标用户的标识信息输入到语音合成模型中,以通过所述语音合成模型获得所述目标用户与所述文本内容对应的声学特征;
    生成模块,用于根据所述声学特征,生成与所述文本内容对应的语音信号,以输出所述语音信号。
  11. 一种电子设备,其特征在于,包括:存储器、处理器;其中,所述存储器上存储有可执行代码,当所述可执行代码被所述处理器执行时,使所述处理器执行如权利要求1至9中任一项所述的语音合成方法。
  12. 一种非暂时性机器可读存储介质,其特征在于,所述非暂时性机器可读存储介质上存储有可执行代码,当所述可执行代码被电子设备的处理器执行时,使所述处理器执行如权利要求1至9中任一项所述的语音合成方法。
PCT/CN2021/076683 2020-02-25 2021-02-18 语音合成方法、装置、设备和存储介质 WO2021169825A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010117047.1A CN113314096A (zh) 2020-02-25 2020-02-25 语音合成方法、装置、设备和存储介质
CN202010117047.1 2020-02-25

Publications (1)

Publication Number Publication Date
WO2021169825A1 true WO2021169825A1 (zh) 2021-09-02

Family

ID=77369952

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/076683 WO2021169825A1 (zh) 2020-02-25 2021-02-18 语音合成方法、装置、设备和存储介质

Country Status (2)

Country Link
CN (1) CN113314096A (zh)
WO (1) WO2021169825A1 (zh)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115440186A (zh) * 2022-09-06 2022-12-06 云知声智能科技股份有限公司 一种音频特征信息生成方法、装置、设备和存储介质
CN115499396A (zh) * 2022-11-16 2022-12-20 北京红棉小冰科技有限公司 具有人格特征的信息生成方法及装置

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115294963A (zh) * 2022-04-12 2022-11-04 阿里巴巴达摩院(杭州)科技有限公司 语音合成模型产品

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1496554A (zh) * 2001-02-26 2004-05-12 ���µ�����ҵ��ʽ���� 声音个性化的语音合成器
US7277855B1 (en) * 2000-06-30 2007-10-02 At&T Corp. Personalized text-to-speech services
CN103065620A (zh) * 2012-12-27 2013-04-24 安徽科大讯飞信息科技股份有限公司 在手机上或网页上接收用户输入的文字并实时合成为个性化声音的方法
CN104123932A (zh) * 2014-07-29 2014-10-29 科大讯飞股份有限公司 一种语音转换系统及方法
CN105185372A (zh) * 2015-10-20 2015-12-23 百度在线网络技术(北京)有限公司 个性化多声学模型的训练方法、语音合成方法及装置
CN106205602A (zh) * 2015-05-06 2016-12-07 上海汽车集团股份有限公司 语音播放方法和系统
CN109346083A (zh) * 2018-11-28 2019-02-15 北京猎户星空科技有限公司 一种智能语音交互方法及装置、相关设备及存储介质
CN109785823A (zh) * 2019-01-22 2019-05-21 中财颐和科技发展(北京)有限公司 语音合成方法及系统
CN110223705A (zh) * 2019-06-12 2019-09-10 腾讯科技(深圳)有限公司 语音转换方法、装置、设备及可读存储介质
CN110767210A (zh) * 2019-10-30 2020-02-07 四川长虹电器股份有限公司 一种生成个性化语音的方法及装置

Family Cites Families (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101312038B (zh) * 2007-05-25 2012-01-04 纽昂斯通讯公司 用于合成语音的方法
EP3151239A1 (en) * 2015-09-29 2017-04-05 Yandex Europe AG Method and system for text-to-speech synthesis
CN107644637B (zh) * 2017-03-13 2018-09-25 平安科技(深圳)有限公司 语音合成方法和装置
CN107767879A (zh) * 2017-10-25 2018-03-06 北京奇虎科技有限公司 基于音色的音频转换方法及装置
CN107945786B (zh) * 2017-11-27 2021-05-25 北京百度网讯科技有限公司 语音合成方法和装置
US10811000B2 (en) * 2018-04-13 2020-10-20 Mitsubishi Electric Research Laboratories, Inc. Methods and systems for recognizing simultaneous speech by multiple speakers
CN109036377A (zh) * 2018-07-26 2018-12-18 中国银联股份有限公司 一种语音合成方法及装置
CN109147758B (zh) * 2018-09-12 2020-02-14 科大讯飞股份有限公司 一种说话人声音转换方法及装置
CN109859736B (zh) * 2019-01-23 2021-05-25 北京光年无限科技有限公司 语音合成方法及系统
CN109887484B (zh) * 2019-02-22 2023-08-04 平安科技(深圳)有限公司 一种基于对偶学习的语音识别与语音合成方法及装置
CN109767752B (zh) * 2019-02-27 2023-05-26 平安科技(深圳)有限公司 一种基于注意力机制的语音合成方法及装置
CN110136692B (zh) * 2019-04-30 2021-12-14 北京小米移动软件有限公司 语音合成方法、装置、设备及存储介质
CN110211564A (zh) * 2019-05-29 2019-09-06 泰康保险集团股份有限公司 语音合成方法及装置、电子设备和计算机可读介质
CN110288972B (zh) * 2019-08-07 2021-08-13 北京新唐思创教育科技有限公司 语音合成模型训练方法、语音合成方法及装置
CN110600045A (zh) * 2019-08-14 2019-12-20 科大讯飞股份有限公司 声音转换方法及相关产品
CN110807093A (zh) * 2019-10-30 2020-02-18 中国联合网络通信集团有限公司 语音处理方法、装置及终端设备

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7277855B1 (en) * 2000-06-30 2007-10-02 At&T Corp. Personalized text-to-speech services
CN1496554A (zh) * 2001-02-26 2004-05-12 ���µ�����ҵ��ʽ���� 声音个性化的语音合成器
CN103065620A (zh) * 2012-12-27 2013-04-24 安徽科大讯飞信息科技股份有限公司 在手机上或网页上接收用户输入的文字并实时合成为个性化声音的方法
CN104123932A (zh) * 2014-07-29 2014-10-29 科大讯飞股份有限公司 一种语音转换系统及方法
CN106205602A (zh) * 2015-05-06 2016-12-07 上海汽车集团股份有限公司 语音播放方法和系统
CN105185372A (zh) * 2015-10-20 2015-12-23 百度在线网络技术(北京)有限公司 个性化多声学模型的训练方法、语音合成方法及装置
CN109346083A (zh) * 2018-11-28 2019-02-15 北京猎户星空科技有限公司 一种智能语音交互方法及装置、相关设备及存储介质
CN109785823A (zh) * 2019-01-22 2019-05-21 中财颐和科技发展(北京)有限公司 语音合成方法及系统
CN110223705A (zh) * 2019-06-12 2019-09-10 腾讯科技(深圳)有限公司 语音转换方法、装置、设备及可读存储介质
CN110767210A (zh) * 2019-10-30 2020-02-07 四川长虹电器股份有限公司 一种生成个性化语音的方法及装置

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115440186A (zh) * 2022-09-06 2022-12-06 云知声智能科技股份有限公司 一种音频特征信息生成方法、装置、设备和存储介质
CN115499396A (zh) * 2022-11-16 2022-12-20 北京红棉小冰科技有限公司 具有人格特征的信息生成方法及装置
CN115499396B (zh) * 2022-11-16 2023-04-07 北京红棉小冰科技有限公司 具有人格特征的信息生成方法及装置

Also Published As

Publication number Publication date
CN113314096A (zh) 2021-08-27

Similar Documents

Publication Publication Date Title
US10614803B2 (en) Wake-on-voice method, terminal and storage medium
CN111933129B (zh) 音频处理方法、语言模型的训练方法、装置及计算机设备
WO2021169825A1 (zh) 语音合成方法、装置、设备和存储介质
US20240021202A1 (en) Method and apparatus for recognizing voice, electronic device and medium
CN111312231B (zh) 音频检测方法、装置、电子设备及可读存储介质
CN111951779B (zh) 语音合成的前端处理方法及相关设备
US12106746B2 (en) Audio synthesis method and apparatus, computer readable medium, and electronic device
WO2022178969A1 (zh) 语音对话数据处理方法、装置、计算机设备及存储介质
CN113205793B (zh) 音频生成方法、装置、存储介质及电子设备
CN112863489B (zh) 语音识别方法、装置、设备及介质
CN111079423A (zh) 一种听写报读音频的生成方法、电子设备及存储介质
CN114242033A (zh) 语音合成方法、装置、设备、存储介质及程序产品
CN112185363A (zh) 音频处理方法及装置
CN114842826A (zh) 语音合成模型的训练方法、语音合成方法及相关设备
CN117043856A (zh) 高效流式非递归设备上的端到端模型
CN112397053A (zh) 语音识别方法、装置、电子设备及可读存储介质
CN113393841B (zh) 语音识别模型的训练方法、装置、设备及存储介质
CN113223513A (zh) 语音转换方法、装置、设备和存储介质
CN113505612B (zh) 多人对话语音实时翻译方法、装置、设备及存储介质
CN113808593A (zh) 语音交互系统、相关方法、装置及设备
Lin et al. VoiceTalk: Multimedia-IoT Applications for Mixing Mandarin, Taiwanese, and English
CN115394298B (zh) 语音识别文本标点预测模型训练方法和预测方法
KR20180103273A (ko) 음성 합성 장치 및 음성 합성 방법
CN118658449A (zh) 语音合成方法及相关产品
CN117496981A (zh) 语音识别模型的训练方法、装置、电子设备及存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21759516

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21759516

Country of ref document: EP

Kind code of ref document: A1