WO2020114323A1 - Method and apparatus for customized speech synthesis - Google Patents

Method and apparatus for customized speech synthesis Download PDF

Info

Publication number
WO2020114323A1
WO2020114323A1 PCT/CN2019/121852 CN2019121852W WO2020114323A1 WO 2020114323 A1 WO2020114323 A1 WO 2020114323A1 CN 2019121852 W CN2019121852 W CN 2019121852W WO 2020114323 A1 WO2020114323 A1 WO 2020114323A1
Authority
WO
WIPO (PCT)
Prior art keywords
target
user
audio file
recorded text
speech synthesis
Prior art date
Application number
PCT/CN2019/121852
Other languages
French (fr)
Chinese (zh)
Inventor
孙尧
Original Assignee
阿里巴巴集团控股有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 阿里巴巴集团控股有限公司 filed Critical 阿里巴巴集团控股有限公司
Publication of WO2020114323A1 publication Critical patent/WO2020114323A1/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification
    • G10L17/04Training, enrolment or model building

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Computational Linguistics (AREA)
  • Telephonic Communication Services (AREA)

Abstract

Disclosed are a method and apparatus for customized speech synthesis. The method comprises: receiving a TTS model generation request input by a user, wherein the TTS model generation request comprises a target field identifier (102); sending to the user a target record text corresponding to the target field identifier and receiving an audio file corresponding to the target record text and returned by the user, wherein the audio file is obtained by the user who performs recording according to the target record text (104); and according to the audio file, generating for the user a target TTS model corresponding to the target field identifier, wherein the target TTS model is used for providing a customized speech synthesis service having a pronunciation feature of the user (106).

Description

一种用于个性化语音合成的方法和装置Method and device for personalized speech synthesis
本申请要求2018年12月6日递交的申请号为201811489961.8、发明名称为“一种用于个性化语音合成的方法和装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application requires the priority of the Chinese patent application with the application number 201811489961.8 and the invention titled "A method and device for personalized speech synthesis" submitted on December 6, 2018, the entire content of which is incorporated by reference in this application in.
技术领域Technical field
本申请涉及计算机技术领域,尤其涉及一种用于个性化语音合成的方法和装置。The present application relates to the field of computer technology, and in particular to a method and device for personalized speech synthesis.
背景技术Background technique
语音合成技术,也称为从文本到语音技术(TTS,Text To Speech),可以实现将文本信息转化为语音输出。具体地,首先,采集大量语音数据;然后,根据采集到的大量语音数据,生成TTS模型;最后,根据TTS模型,实现将文本信息转化为语音输出。由于传统TTS模型构建过程需要采集大量的语音数据,使得TTS模型的构建过程较为复杂。Speech synthesis technology, also known as text-to-speech technology (TTS, Text To Speech), can convert text information into speech output. Specifically, first, a large amount of voice data is collected; then, a TTS model is generated based on the collected large amount of voice data; and finally, the text information is converted into voice output according to the TTS model. Because the traditional TTS model construction process needs to collect a large amount of voice data, the construction process of the TTS model is more complicated.
因此,需要一种更加容易实现的用于个性化语音合成的方法。Therefore, a more easily implemented method for personalized speech synthesis is needed.
发明内容Summary of the invention
本说明书实施例提供一种用于个性化语音合成的方法和装置,使得可以简化TTS模型的生成过程。The embodiments of the present specification provide a method and device for personalized speech synthesis, so that the generation process of the TTS model can be simplified.
第一方面,本说明书实施例提供了一种用于个性化语音合成的方法,包括:In a first aspect, an embodiment of this specification provides a method for personalized speech synthesis, including:
接收用户输入的语音合成TTS模型生成请求,所述TTS模型生成请求中包括目标领域标识;Receiving a speech synthesis TTS model generation request input by a user, where the TTS model generation request includes a target domain identifier;
向所述用户发送与所述目标领域标识对应的目标录音文本,并接收所述用户返回的与所述目标录音文本对应的音频文件,所述音频文件是所述用户根据所述目标录音文本录制得到的;Sending a target recorded text corresponding to the target domain identifier to the user, and receiving an audio file corresponding to the target recorded text returned by the user, the audio file is recorded by the user according to the target recorded text owned;
根据所述音频文件,为所述用户生成与所述目标领域标识对应的目标TTS模型,所述目标TTS模型用于提供具有所述用户发音特点的个性化语音合成服务。According to the audio file, a target TTS model corresponding to the target domain identifier is generated for the user, and the target TTS model is used to provide a personalized speech synthesis service with the user's pronunciation characteristics.
第二方面,本说明书实施例还提供了一种用于个性化语音合成的装置,用于执行如第一方面所述的用于个性化语音合成的方法,所述装置包括:In a second aspect, an embodiment of the present specification also provides an apparatus for personalized speech synthesis, for performing the method for personalized speech synthesis as described in the first aspect, the apparatus includes:
接收模块,接收用户输入的TTS模型生成请求,所述TTS模型生成请求中包括目标领域标识;The receiving module receives a TTS model generation request input by a user, and the TTS model generation request includes a target domain identifier;
发送模块,向所述用户发送与所述目标领域标识对应的目标录音文本;A sending module, sending the target recorded text corresponding to the target domain identifier to the user;
所述接收模块,接收所述用户返回的与所述目标录音文本对应的音频文件,所述音频文件是所述用户根据所述目标录音文本录制得到的;The receiving module receives an audio file corresponding to the target recorded text returned by the user, and the audio file is obtained by the user according to the target recorded text;
TTS模型生成模块,根据所述音频文件,为所述用户生成与所述目标领域标识对应的目标TTS模型,所述目标TTS模型用于提供具有所述用户发音特点的个性化语音合成服务。The TTS model generation module generates a target TTS model corresponding to the target domain identifier for the user according to the audio file, and the target TTS model is used to provide a personalized speech synthesis service with the user's pronunciation characteristics.
第三方面,本说明书实施例还提供了一种电子设备,包括:In a third aspect, an embodiment of this specification also provides an electronic device, including:
存储器,存放程序;Memory, store programs;
处理器,执行所述存储器存储的程序,并具体执行如第一方面所述的用于个性化语音合成的方法。The processor executes the program stored in the memory, and specifically executes the method for personalized speech synthesis as described in the first aspect.
第四方面,本说明书实施例还提供了一种计算机可读存储介质,所述计算机可读存储介质存储一个或多个程序,所述一个或多个程序当被包括多个应用程序的电子设备执行时,使得所述电子设备执行如第一方面所述的用于个性化语音合成的方法。According to a fourth aspect, the embodiments of the present specification also provide a computer-readable storage medium that stores one or more programs, where the one or more programs are electronic devices including multiple application programs When executed, the electronic device is caused to execute the method for personalized speech synthesis as described in the first aspect.
本申请实施例采用的上述至少一个技术方案能够达到以下有益效果:The at least one technical solution adopted in the embodiments of the present application can achieve the following beneficial effects:
接收用户输入的包括目标领域标识的TTS模型生成请求,向用户发送与目标领域标识对应的目标录音文本,并接收用户返回的与目标录音文本对应的音频文件,音频文件是用户根据目标录音文本录制得到的,进而根据音频文件,为用户生成与目标领域标识对应的目标TTS模型,目标TTS模型用于提供具有用户发音特点的个性化语音合成服务,从而可以简化TTS模型的生成过程,降低了个性化语音合成服务的成本。Receive a TTS model generation request including the target domain ID input by the user, send the target recorded text corresponding to the target domain ID to the user, and receive the audio file corresponding to the target recorded text returned by the user. The audio file is recorded by the user according to the target recorded text According to the audio file, a target TTS model corresponding to the target domain identification is generated for the user. The target TTS model is used to provide a personalized speech synthesis service with user pronunciation features, which can simplify the generation process of the TTS model and reduce personality Cost of speech synthesis services.
附图说明BRIEF DESCRIPTION
此处所说明的附图用来提供对本申请的进一步理解,构成本申请的一部分,本申请的示意性实施例及其说明用于解释本申请,并不构成对本申请的不当限定。在附图中:The drawings described herein are used to provide a further understanding of the present application and form a part of the present application. The schematic embodiments and descriptions of the present application are used to explain the present application and do not constitute an undue limitation on the present application. In the drawings:
图1为本说明书实施例提供的一种用于个性化语音合成的方法的流程示意图;1 is a schematic flowchart of a method for personalized speech synthesis provided by an embodiment of the present specification;
图2为本说明书实施例提供的一种电子设备的结构示意图;2 is a schematic structural diagram of an electronic device provided by an embodiment of the present specification;
图3为本说明书实施例提供的一种用于个性化语音合成的装置的结构示意图。FIG. 3 is a schematic structural diagram of an apparatus for personalized speech synthesis provided by an embodiment of the present specification.
具体实施方式detailed description
下面结合本说明书具体实施例及相应的附图对本申请技术方案进行清楚、完整地描述。显然,所描述的实施例仅是本申请一部分实施例,而不是全部的实施例。基于本说 明书中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。The technical solution of the present application will be described clearly and completely with reference to specific embodiments of the present specification and corresponding drawings. Obviously, the described embodiments are only a part of the embodiments of the present application, but not all the embodiments. Based on the embodiments in this description, all other embodiments obtained by a person of ordinary skill in the art without creative work fall within the protection scope of the present application.
以下结合附图,详细说明本说明书各实施例提供的技术方案。The technical solutions provided by the embodiments of this specification are described in detail below in conjunction with the drawings.
图1为本说明书实施例提供的一种用于个性化语音合成的方法的流程示意图。所述方法可以如下所示。FIG. 1 is a schematic flowchart of a method for personalized speech synthesis provided by an embodiment of the present specification. The method may be as follows.
步骤102,接收用户输入的TTS模型生成请求,TTS模型生成请求中包括目标领域标识。Step 102: Receive a TTS model generation request input by the user, and the TTS model generation request includes the target domain identifier.
步骤104,向用户发送与目标领域标识对应的目标录音文本,并接收用户返回的与目标录音文本对应的音频文件,音频文件是用户根据目标录音文本录制得到的。Step 104: Send the target recorded text corresponding to the target domain identifier to the user, and receive the audio file corresponding to the target recorded text returned by the user. The audio file is recorded by the user according to the target recorded text.
步骤106,根据音频文件,为用户生成与目标领域标识对应的目标TTS模型,目标TTS模型用于提供具有用户发音特点的个性化语音合成服务。In step 106, according to the audio file, a target TTS model corresponding to the target domain identification is generated for the user. The target TTS model is used to provide a personalized speech synthesis service with user pronunciation characteristics.
其中,向用户发送与目标领域标识对应的目标录音文本,包括:Among them, sending the target recorded text corresponding to the target domain identifier to the user includes:
确定录音文本数据库,录音文本数据库中包括不同领域标识对应的录音文本;Determine the recorded text database, which includes the recorded text corresponding to the identification of different fields;
根据录音文本数据库,确定与目标领域标识对应的所述目标录音文本;According to the recorded text database, determine the target recorded text corresponding to the target domain identifier;
向用户发送目标录音文本。Send the target recorded text to the user.
具体地,通过以下方式确定得到所述录音文本数据库:Specifically, the recording text database is determined and obtained in the following manner:
确定不同领域标识,不同领域标识中的任一领域标识对应一个领域;Identify different domain IDs, any one of the domain IDs corresponds to a domain
根据预设算法,生成与任一领域标识对应的录音文本,在任一领域标识对应的录音文本中,包括与该领域标识对应的领域中常见的字和/或词语。According to the preset algorithm, the recorded text corresponding to any domain identifier is generated, and the recorded text corresponding to any domain identifier includes common words and/or words in the domain corresponding to the domain identifier.
领域标识包括下述至少一种:The domain identification includes at least one of the following:
儿童故事领域标识、交通领域标识、社会新闻领域标识,和天气预报领域标识。Children's story field logo, traffic field logo, social news field logo, and weather forecast field logo.
个性化语音合成系统,根据生活常识,确定日常生活中的不同领域,例如,儿童故事领域、交通领域、社会新闻领域、天气预报领域,等。任一领域对应一个领域标识,例如,儿童故事领域与儿童故事领域标识对应、交通领域与交通领域标识对应、社会新闻领域与社会新闻领域标识对应、天气预报领域与天气预报领域标识对应,等。The personalized speech synthesis system determines different areas of daily life based on common sense of life, for example, the field of children's stories, the field of transportation, the field of social news, the field of weather forecast, etc. Any field corresponds to a field logo, for example, the children's story field corresponds to the children's story field logo, the traffic field corresponds to the traffic field logo, the social news field corresponds to the social news field logo, the weather forecast field corresponds to the weather forecast field logo, etc.
根据预设算法,生成与任一领域对应的最优的录音文本,即与任一领域标识对应的录音文本。在任一领域对应的录音文本中,包括与该领域中常见的字和/或词语。According to the preset algorithm, the optimal recorded text corresponding to any field is generated, that is, the recorded text corresponding to the identification of any field. The recorded text corresponding to any field includes words and/or words that are common in the field.
例如,根据预设算法,生成儿童故事领域对应的最优的录音文本,该录音文本中包括儿童故事领域中常见的字/或词语。For example, according to a preset algorithm, an optimal recorded text corresponding to the field of children's stories is generated, and the recorded text includes words and/or words common in the field of children's stories.
需要说明的是,预设算法可以根据实际情况确定,这里不做具体限定。It should be noted that the preset algorithm can be determined according to actual conditions, and is not specifically limited here.
任一领域对应的最优的录音文本,包含该领域中常见的字/或词语对应的中文主要音节,且尽量避免重复,以精简录音文本的数据量。The optimal recorded text corresponding to any field contains the main Chinese syllables corresponding to the common words and/or words in the field, and try to avoid repetition to simplify the data volume of the recorded text.
任一领域对应的最优的录音文本,按照常规语速来说,尽量将于录音文本对应的音频文件控制在预设时长(例如,20~60分钟)范围内,以提高音频文件的获取速度。For the best recorded text corresponding to any field, according to the normal speaking rate, try to control the audio file corresponding to the recorded text within the preset duration (for example, 20 to 60 minutes) to improve the speed of acquiring the audio file .
此外,由于任一领域对应的最优的录音文本,需要适配该领域中常见的字/或词语,因此,该录音文本可以不具备完整的情节。In addition, since the optimal recorded text corresponding to any field needs to be adapted to common words and/or words in the field, the recorded text may not have a complete plot.
当用户需要构建TTS模型时,可以登录智能终端上个性化语音合成系统对应的应用程序(以下简称APP),并在该应用程序中选择目标领域标识,进而使得个性化语音合成系统接收到包括目标领域标识的TTS模型生成请求。When the user needs to build a TTS model, he can log in to the application corresponding to the personalized speech synthesis system (hereinafter referred to as APP) on the smart terminal, and select the target domain identifier in the application, so that the personalized speech synthesis system receives the target Domain identification TTS model generation request.
个性化语音合成系统从录音文本数据库中,查找到与目标领域标识对应的目标录音文本,并将该目标录音文本发送到用户智能终端中的APP。The personalized speech synthesis system finds the target recorded text corresponding to the target domain identification from the recorded text database, and sends the target recorded text to the APP in the user's smart terminal.
用户接收到该目标录音文本之后,可以在安静的周边环境中,通过自身的智能终端录制与目标录音文本对应的音频文件,进而将录制得到的音频文件发送到个性化语音合成系统对应的云端私有TTS存储和建模空间中。After receiving the target recorded text, the user can record the audio file corresponding to the target recorded text through his own smart terminal in a quiet surrounding environment, and then send the recorded audio file to the cloud private corresponding to the personalized speech synthesis system TTS storage and modeling space.
本说明书实施例中,根据音频文件,为用户生成与目标领域标识对应的目标TTS模型,包括:In the embodiment of the present specification, according to the audio file, a target TTS model corresponding to the target domain identification is generated for the user, including:
对音频文件进行预处理,得到处理后音频文件;Pre-process audio files to get processed audio files;
根据处理后的音频文件,确定与用户发音特点匹配的特征参数;According to the processed audio file, determine the characteristic parameters that match the user's pronunciation characteristics;
根据特征参数,生成目标TTS模型。According to the characteristic parameters, the target TTS model is generated.
其中,特征参数包括下述至少一种:Among them, the characteristic parameters include at least one of the following:
音调、音色、语速、停顿,和口音。Tone, timbre, speed, pause, and accent.
对音频文件进行预处理,包括下述至少一个步骤:Preprocessing audio files includes at least one of the following steps:
对音频文件进行降噪处理;Perform noise reduction processing on audio files;
通过自动语言识别技术,判断音频文件是否正确。Through automatic language recognition technology, determine whether the audio file is correct.
在个性化语音合成系统对应的云端私有TTS存储和建模空间中,TTS模型生成模块,首先对于目标录音文本对应的音频文件进行降噪处理,进而通过自动语言识别(ASR,Automatic Speech Recognition)技术将降噪之后的音频文件转化为文本文件,进而将该文本文件与目标录音文本进行匹配,判断该音频文件是否正确。若该音频文件正确,则得到处理后的音频文件。In the cloud private TTS storage and modeling space corresponding to the personalized speech synthesis system, the TTS model generation module first performs noise reduction processing on the audio file corresponding to the target recorded text, and then uses automatic language recognition (ASR, Automatic Speech Recognition) technology Convert the audio file after noise reduction into a text file, and then match the text file with the target recorded text to determine whether the audio file is correct. If the audio file is correct, the processed audio file is obtained.
根据处理后的音频文件进行个性化TTS建模,得到与处理后的音频文件最接近的特 征参数,即得到与用户发音特点匹配的特征参数,其中,特征参数包括但不限于:音调、音色、语速、停顿、口音,等。Personalized TTS modeling based on the processed audio file to obtain the closest feature parameters to the processed audio file, that is, the feature parameters that match the user's pronunciation characteristics, where the feature parameters include but are not limited to: tone, timbre, Speaking speed, pauses, accents, etc.
从而根据与用户发音特点匹配的特征参数,生成在目标领域标识对应的领域内,可以提供具有用户发音特点的个性化语音合成服务的目标TTS模型。Therefore, according to the characteristic parameters matching the user's pronunciation characteristics, a target TTS model that generates a personalized speech synthesis service with the user's pronunciation characteristics in the field corresponding to the target domain identification can be provided.
通过用户自身智能终端对目标录音文本进行录制得到音频文件,进而通过音频文件生成目标TTS模型,有效简化了TTS模型的生成过程,而且相对于现有技术中录音棚录制音频文件来说,可以大大节约录音成本。The audio file is obtained by recording the target recorded text through the user's own smart terminal, and then the target TTS model is generated from the audio file, which effectively simplifies the generation process of the TTS model, and compared to the recording audio file recorded in the studio in the prior art, it can be greatly Save recording costs.
对于生成的目标TTS模型,个性化语音合成系统提供云端服务,即该目标TTS模型可以被经过用户授权的智能终端调用。For the generated target TTS model, the personalized speech synthesis system provides cloud services, that is, the target TTS model can be called by an intelligent terminal authorized by the user.
本说明书实施例中,还包括:The embodiment of this specification also includes:
接收语音播报请求,语音播报请求中包括与用户对应的授权信息;Receive a voice broadcast request, which includes authorization information corresponding to the user;
根据语音播报请求,使用目标TTS模型,提供个性化语音合成服务。According to the voice broadcast request, use the target TTS model to provide personalized voice synthesis services.
其中,个性化语音合成服务包括下述至少一种:Among them, the personalized speech synthesis service includes at least one of the following:
讲故事、播报天气预报、播报时间,和播报新闻。Tell stories, broadcast weather forecasts, broadcast time, and broadcast news.
语音播报请求来自于发送TTS模型生成请求的用户,或经该用户授权的其他用户。The voice broadcast request comes from the user who sent the TTS model generation request, or another user authorized by the user.
当个性化语音合成系统接收到包含用户对应的授权信息的语音播报请求时,可以调用云端存储的该用户对应的目标TTS模型,进而根据该目标TTS模型,提供个性化语音合成服务。When the personalized speech synthesis system receives a voice broadcast request containing authorization information corresponding to the user, it can call the target TTS model corresponding to the user stored in the cloud, and then provide a personalized speech synthesis service according to the target TTS model.
在一实施例中,个性化语音合成系统为用户A生成了与儿童故事领域标识对应的目标TTS模型。使得当用户A在上班,无法陪伴子女时,其子女可以通过家中智能设备,访问个性化语音合成系统的云端服务,要求“爸爸给我讲个小猪佩奇的故事”,个性化语音合成系统对应的私有云服务器识别出是经过用户A授权的用户A子女的访问,可以称呼子女的小名,比如“豆豆,爸爸给你讲故事”。然后可以根据目标TTS模型生成的用户A的语音来讲述小猪佩奇的故事(其中,儿童故事本身来自智能设备对应的公有云服务器)。In an embodiment, the personalized speech synthesis system generates a target TTS model corresponding to the child story domain identification for user A. So that when user A is at work and unable to accompany his children, their children can access the cloud service of the personalized speech synthesis system through the smart device at home, requesting "Daddy tell me the story of a piggy page", personalized speech synthesis system The corresponding private cloud server recognizes that the user A’s child is authorized by user A to visit, and can call the child’s nickname, such as “Doudou, Dad tells you a story”. Then, according to the voice of user A generated by the target TTS model, the story of Piggy Page (where the children's story itself comes from the public cloud server corresponding to the smart device) can be told.
在另一实施例中,个性化语音合成系统为用户B生成了与天气预报领域标识对应的目标TTS模型。使得生活在农村的用户B的父母,通过家中经过用户B授权的智能设备(例如,登录了用户B对应的账号),访问个性化语音合成系统的云端服务查询天气时,可以根据目标TTS模型生成的用户B的语音来播报天气,提醒用户B的父母注意天气变化,使得用户B的父母可以感受到温馨的亲情。In another embodiment, the personalized speech synthesis system generates a target TTS model corresponding to the weather forecast domain identifier for user B. When the parents of user B living in the countryside can access the cloud service of the personalized speech synthesis system to query the weather through a smart device authorized by user B at home (for example, log in to the account corresponding to user B), they can be generated according to the target TTS model The voice of user B broadcasts the weather, reminding the parents of user B to pay attention to the weather changes, so that the parents of user B can feel the warm affection.
在另一实施例中,个性化语音合成系统为用户C生成目标TTS模型之后,若用户C去世,但是用户C的亲人仍然可以通过经过用户C授权的智能设备(例如,登录了用户C对应的账号),访问个性化语音合成系统的云端服务,进而根据目标TTS模型生成的用户C的语音来播报天气、讲故事、播新闻、讲笑话,等,使得亲人仍然能够感受到用户C的陪伴。In another embodiment, after the personalized speech synthesis system generates the target TTS model for user C, if user C dies, but the relatives of user C can still use the smart device authorized by user C (for example, the user corresponding to user C is logged in Account), access the cloud service of the personalized speech synthesis system, and then broadcast the weather, tell stories, broadcast news, tell jokes, etc. according to the voice of user C generated by the target TTS model, so that relatives can still feel the companionship of user C.
本说明书实施例中,当接收到的语音播报请求对应的领域,与目标TTS模型对应的目标领域标识不一致时,若仍然采用目标TTS模型来提供个性化语音合成服务,将导致播报效果较差。此时,可以调用公有云服务器中存储的全领域TTS模型,为用户提供较好的语音合成服务。In the embodiment of the present specification, when the domain corresponding to the received voice broadcast request and the target domain identifier corresponding to the target TTS model are inconsistent, if the target TTS model is still used to provide a personalized voice synthesis service, the broadcast effect will be poor. At this time, the full-field TTS model stored in the public cloud server can be invoked to provide users with better speech synthesis services.
其中,公有云服务器中存储的全领域TTS模型可以是根据现有技术中通过采集大量语音数据构建得到的,也可以是通过其它方式构建得到的,这里不做具体限定。Among them, the full-field TTS model stored in the public cloud server may be constructed according to the prior art by collecting a large amount of voice data, or may be constructed by other methods, which is not specifically limited here.
本说明书实施例记载的技术方案,接收用户输入的包括目标领域标识的TTS模型生成请求,向用户发送与目标领域标识对应的目标录音文本,并接收用户返回的与目标录音文本对应的音频文件,音频文件是用户根据目标录音文本录制得到的,进而根据音频文件,为用户生成与目标领域标识对应的目标TTS模型,目标TTS模型用于提供具有用户发音特点的个性化语音合成服务,从而可以简化TTS模型的生成过程,降低了个性化语音合成服务的成本。The technical solution described in the embodiment of the present specification receives a TTS model generation request including a target domain identifier input by a user, sends a target recorded text corresponding to the target domain identifier to the user, and receives an audio file corresponding to the target recorded text returned by the user, The audio file is obtained by the user according to the target recording text, and then according to the audio file, a target TTS model corresponding to the target domain identification is generated for the user. The target TTS model is used to provide a personalized speech synthesis service with user pronunciation features, which can be simplified The generation process of the TTS model reduces the cost of personalized speech synthesis services.
图2为本说明书实施例提供的一种电子设备的结构示意图。如图2所示,在硬件层面,该电子设备包括处理器,可选地还包括内部总线、网络接口、存储器。其中,存储器可能包含内存,例如高速随机存取存储器(Random-Access Memory,RAM),也可能还包括非易失性存储器(non-volatile memory),例如至少1个磁盘存储器等。当然,该电子设备还可能包括其他业务所需要的硬件。2 is a schematic structural diagram of an electronic device according to an embodiment of the present specification. As shown in FIG. 2, at the hardware level, the electronic device includes a processor, and optionally also includes an internal bus, a network interface, and a memory. The memory may include a memory, such as a high-speed random access memory (Random-Access Memory, RAM), or may also include a non-volatile memory (non-volatile memory), such as at least one disk memory. Of course, the electronic device may also include hardware required for other services.
处理器、网络接口和存储器可以通过内部总线相互连接,该内部总线可以是ISA(Industry Standard Architecture,工业标准体系结构)总线、PCI(Peripheral Component Interconnect,外设部件互连标准)总线或EISA(Extended Industry Standard Architecture,扩展工业标准结构)总线等。所述总线可以分为地址总线、数据总线、控制总线等。为便于表示,图2中仅用一个双向箭头表示,但并不表示仅有一根总线或一种类型的总线。The processor, network interface and memory can be connected to each other through an internal bus, which can be an ISA (Industry Standard Architecture, Industry Standard Architecture) bus, a PCI (Peripheral Component Interconnect) bus, or an EISA (Extended Industry, Standard Architecture, extended industry standard structure) bus, etc. The bus can be divided into an address bus, a data bus, and a control bus. For ease of representation, only one bidirectional arrow is used in FIG. 2, but it does not mean that there is only one bus or one type of bus.
存储器,存放程序。具体地,程序可以包括程序代码,所述程序代码包括计算机操作指令。存储器可以包括内存和非易失性存储器,并向处理器提供指令和数据。Memory, store programs. Specifically, the program may include program code, and the program code includes a computer operation instruction. The memory may include memory and non-volatile memory, and provide instructions and data to the processor.
处理器从非易失性存储器中读取对应的计算机程序到内存中然后运行,在逻辑层面 上形成用于个性化语音合成的装置。处理器,执行存储器所存放的程序,并具体执行图1所示的方法实施例的步骤。The processor reads the corresponding computer program from the non-volatile memory into the memory and then runs it to form a device for personalized speech synthesis at a logical level. The processor executes the program stored in the memory, and specifically executes the steps of the method embodiment shown in FIG. 1.
上述如图1所述的方法可以应用于处理器中,或者由处理器实现。处理器可能是一种集成电路芯片,具有信号的处理能力。在实现过程中,上述方法的各步骤可以通过处理器中的硬件的集成逻辑电路或者软件形式的指令完成。上述的处理器可以是通用处理器,包括中央处理器(Central Processing Unit,CPU)、网络处理器(Network Processor,NP)等;还可以是数字信号处理器(Digital Signal Processor,DSP)、专用集成电路(Application Specific Integrated Circuit,ASIC)、现场可编程门阵列(Field-Programmable Gate Array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件。可以实现或者执行本说明书实施例中的公开的各方法、步骤及逻辑框图。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。结合本说明书实施例所公开的方法的步骤可以直接体现为硬件译码处理器执行完成,或者用译码处理器中的硬件及软件模块组合执行完成。软件模块可以位于随机存储器,闪存、只读存储器,可编程只读存储器或者电可擦写可编程存储器、寄存器等本领域成熟的存储介质中。该存储介质位于存储器,处理器读取存储器中的信息,结合其硬件完成上述方法的步骤。The above method as shown in FIG. 1 may be applied to the processor, or implemented by the processor. The processor may be an integrated circuit chip with signal processing capabilities. In the implementation process, each step of the above method may be completed by an integrated logic circuit of hardware in the processor or instructions in the form of software. The aforementioned processor may be a general-purpose processor, including a central processor (Central Processing Unit, CPU), a network processor (Network Processor, NP), etc.; it may also be a digital signal processor (Digital Signal Processor, DSP), dedicated integration Circuit (Application Specific Integrated Circuit, ASIC), field programmable gate array (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gates or transistor logic devices, discrete hardware components. The methods, steps, and logical block diagrams disclosed in the embodiments of the present specification can be implemented or executed. The general-purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the method disclosed in conjunction with the embodiments of the present specification may be directly embodied and executed by a hardware decoding processor, or may be executed and completed by a combination of hardware and software modules in the decoding processor. The software module may be located in a mature storage medium in the art, such as random access memory, flash memory, read-only memory, programmable read-only memory, or electrically erasable programmable memory, and registers. The storage medium is located in the memory. The processor reads the information in the memory and completes the steps of the above method in combination with its hardware.
该电子设备可执行图1所示方法实施例执行的方法,并实现上述图1所示方法实施例的功能,本说明书实施例在此不再赘述。The electronic device can execute the method executed by the method embodiment shown in FIG. 1 and implement the functions of the method embodiment shown in FIG. 1, and the embodiments of this specification will not be described here.
本说明书实施例还提出了一种计算机可读存储介质,该计算机可读存储介质存储一个或多个程序,该一个或多个程序包括指令,该指令当被包括多个应用程序的电子设备执行时,能够使该电子设备执行图1所示实施例中的用于个性化语音合成的方法,并具体执行图1所示方法实施例的步骤。The embodiments of the present specification also propose a computer-readable storage medium that stores one or more programs, and the one or more programs include instructions, which are executed by an electronic device that includes multiple application programs At this time, the electronic device can execute the method for personalized speech synthesis in the embodiment shown in FIG. 1, and specifically perform the steps of the method embodiment shown in FIG.
图3为本说明书实施例提供的一种用于个性化语音合成的装置的结构示意图。图3所示的装置300可以用于执行上述图1所示实施例的方法,装置300包括:FIG. 3 is a schematic structural diagram of an apparatus for personalized speech synthesis provided by an embodiment of the present specification. The apparatus 300 shown in FIG. 3 may be used to perform the method of the embodiment shown in FIG. 1 above. The apparatus 300 includes:
接收模块301,接收用户输入的TTS模型生成请求,TTS模型生成请求中包括目标领域标识;The receiving module 301 receives the TTS model generation request input by the user, and the TTS model generation request includes the target domain identifier;
发送模块302,向用户发送与目标领域标识对应的目标录音文本;The sending module 302 sends the target recorded text corresponding to the target domain identifier to the user;
接收模块301,接收用户返回的与目标录音文本对应的音频文件,音频文件是用户根据目标录音文本录制得到的;The receiving module 301 receives an audio file corresponding to the target recorded text returned by the user, and the audio file is recorded by the user according to the target recorded text;
TTS模型生成模块303,根据音频文件,为用户生成与目标领域标识对应的目标TTS模型,目标TTS模型用于提供具有用户发音特点的个性化语音合成服务。The TTS model generation module 303 generates a target TTS model corresponding to the target domain identifier for the user according to the audio file. The target TTS model is used to provide a personalized speech synthesis service with user pronunciation characteristics.
可选地,发送模块302,进一步包括:Optionally, the sending module 302 further includes:
第一确定单元,确定录音文本数据库,录音文本数据库中包括不同领域标识对应的录音文本;The first determining unit determines the recorded text database, and the recorded text database includes the recorded text corresponding to the identifiers of different fields;
第二确定单元,根据录音文本数据库,确定与目标领域标识对应的目标录音文本;The second determining unit determines the target recorded text corresponding to the target domain identifier according to the recorded text database;
发送单元,向用户发送目标录音文本。The sending unit sends the target recorded text to the user.
可选地,通过以下方式确定得到录音文本数据库:Optionally, the recording text database is determined to be obtained in the following manner:
确定不同领域标识,不同领域标识中的任一领域标识对应一个领域;Identify different domain IDs, and any domain ID in different domain IDs corresponds to a domain;
根据预设算法,生成与任一领域标识对应的录音文本,在任一领域标识对应的录音文本中,包括与该领域标识对应的领域中常见的字和/或词语。According to the preset algorithm, the recorded text corresponding to any domain identifier is generated, and the recorded text corresponding to any domain identifier includes common words and/or words in the domain corresponding to the domain identifier.
可选地,领域标识包括下述至少一种:Optionally, the domain identifier includes at least one of the following:
儿童故事领域标识、交通领域标识、社会新闻领域标识,和天气预报领域标识。Children's story field logo, traffic field logo, social news field logo, and weather forecast field logo.
可选地,TTS模型生成模块303,进一步包括:Optionally, the TTS model generation module 303 further includes:
预处理单元,对音频文件进行预处理,得到处理后音频文件;The pre-processing unit preprocesses the audio file to obtain the processed audio file;
第三确定单元,根据处理后的音频文件,确定与用户发音特点匹配的特征参数;The third determining unit determines the characteristic parameters matching the user's pronunciation characteristics according to the processed audio file;
生成单元,根据特征参数,生成目标TTS模型。The generating unit generates the target TTS model according to the characteristic parameters.
可选地,特征参数包括下述至少一种:Optionally, the characteristic parameters include at least one of the following:
音调、音色、语速、停顿,和口音。Tone, timbre, speed, pause, and accent.
可选地,预处理单元,具体用于:Optionally, the pre-processing unit is specifically used for:
对音频文件进行降噪处理;Perform noise reduction processing on audio files;
通过自动语言识别技术,判断音频文件是否正确。Through automatic language recognition technology, judge whether the audio file is correct.
可选地,装置300还包括:Optionally, the device 300 further includes:
接收模块301,接收语音播报请求,语音播报请求中包括与用户对应的授权信息;The receiving module 301 receives a voice broadcast request, and the voice broadcast request includes authorization information corresponding to the user;
服务模块,根据语音播报请求,使用目标TTS模型,提供个性化语音合成服务。The service module uses the target TTS model according to the voice broadcast request to provide personalized voice synthesis services.
可选地,个性化语音合成服务包括下述至少一种:Optionally, the personalized speech synthesis service includes at least one of the following:
讲故事、播报天气预报、播报时间,和播报新闻。Tell stories, broadcast weather forecasts, broadcast time, and broadcast news.
可选地,语音播报请求来自于用户,或经用户授权的其它用户。Optionally, the voice broadcast request comes from the user, or another user authorized by the user.
根据用于个性化语音合成的装置,接收模块接收用户输入的TTS模型生成请求,TTS模型生成请求中包括目标领域标识;发送模块向用户发送与目标领域标识对应的目标录音文本;接收模块接收用户返回的与目标录音文本对应的音频文件,音频文件是用户根据目标录音文本录制得到的;TTS模型生成模块根据音频文件,为用户生成与目标领域 标识对应的目标TTS模型,目标TTS模型用于提供具有用户发音特点的个性化语音合成服务,从而可以简化TTS模型的生成过程,降低了个性化语音合成服务的成本。According to the device for personalized speech synthesis, the receiving module receives the TTS model generation request input by the user, and the TTS model generation request includes the target domain identifier; the sending module sends the target recorded text corresponding to the target domain identifier to the user; the receiving module receives the user The returned audio file corresponding to the target recorded text. The audio file is recorded by the user according to the target recorded text; the TTS model generation module generates a target TTS model corresponding to the target domain identifier for the user according to the audio file. The target TTS model is used to provide Personalized speech synthesis service with user's pronunciation features, which can simplify the generation process of TTS model and reduce the cost of personalized speech synthesis service.
在20世纪90年代,对于一个技术的改进可以很明显地区分是硬件上的改进(例如,对二极管、晶体管、开关等电路结构的改进)还是软件上的改进(对于方法流程的改进)。然而,随着技术的发展,当今的很多方法流程的改进已经可以视为硬件电路结构的直接改进。设计人员几乎都通过将改进的方法流程编程到硬件电路中来得到相应的硬件电路结构。因此,不能说一个方法流程的改进就不能用硬件实体模块来实现。例如,可编程逻辑器件(Programmable Logic Device,PLD)(例如现场可编程门阵列(Field Programmable Gate Array,FPGA))就是这样一种集成电路,其逻辑功能由用户对器件编程来确定。由设计人员自行编程来把一个数字系统“集成”在一片PLD上,而不需要请芯片制造厂商来设计和制作专用的集成电路芯片。而且,如今,取代手工地制作集成电路芯片,这种编程也多半改用“逻辑编译器(logic compiler)”软件来实现,它与程序开发撰写时所用的软件编译器相类似,而要编译之前的原始代码也得用特定的编程语言来撰写,此称之为硬件描述语言(Hardware Description Language,HDL),而HDL也并非仅有一种,而是有许多种,如ABEL(Advanced Boolean Expression Language)、AHDL(Altera Hardware Description Language)、Confluence、CUPL(Cornell University Programming Language)、HDCal、JHDL(Java Hardware Description Language)、Lava、Lola、MyHDL、PALASM、RHDL(Ruby Hardware Description Language)等,目前最普遍使用的是VHDL(Very-High-Speed Integrated Circuit Hardware Description Language)与Verilog。本领域技术人员也应该清楚,只需要将方法流程用上述几种硬件描述语言稍作逻辑编程并编程到集成电路中,就可以很容易得到实现该逻辑方法流程的硬件电路。In the 1990s, the improvement of a technology can be clearly distinguished from the improvement of hardware (for example, the improvement of the circuit structure of diodes, transistors, switches, etc.) or the improvement of software (the improvement of the process flow). However, with the development of technology, the improvement of many methods and processes can be regarded as a direct improvement of the hardware circuit structure. Designers almost get the corresponding hardware circuit structure by programming the improved method flow into the hardware circuit. Therefore, it cannot be said that the improvement of a method flow cannot be realized by hardware physical modules. For example, a programmable logic device (Programmable Logic Device, PLD) (such as a field programmable gate array (Field Programmable Gate Array, FPGA)) is such an integrated circuit, and its logic function is determined by the user programming the device. Designers can program themselves to "integrate" a digital system on a PLD without having to ask chip manufacturers to design and make dedicated integrated circuit chips. Moreover, nowadays, instead of manually making integrated circuit chips, this kind of programming is also mostly implemented with "logic compiler" software, which is similar to the software compiler used in program development and writing, but before compilation The original code must also be written in a specific programming language, which is called hardware description language (Hardware Description Language, HDL), and HDL is not only one, but there are many, such as ABEL (Advanced Boolean Expression) Language , AHDL (AlteraHardwareDescriptionLanguage), Confluence, CUPL (CornellUniversityProgrammingLanguage), HDCal, JHDL (JavaHardwareDescriptionLanguage), Lava, Lola, MyHDL, PALASM, RHDL (RubyHardwareDescription) It is VHDL (Very-High-Speed Integrated Circuit Hardware Description) and Verilog. Those skilled in the art should also be aware that it is easy to obtain the hardware circuit that implements the logic method flow by only slightly programming the method flow in the above hardware description languages and programming it into the integrated circuit.
控制器可以按任何适当的方式实现,例如,控制器可以采取例如微处理器或处理器以及存储可由该(微)处理器执行的计算机可读程序代码(例如软件或固件)的计算机可读介质、逻辑门、开关、专用集成电路(Application Specific Integrated Circuit,ASIC)、可编程逻辑控制器和嵌入微控制器的形式,控制器的例子包括但不限于以下微控制器:ARC 625D、Atmel AT91SAM、Microchip PIC18F26K20以及Silicone Labs C8051F320,存储器控制器还可以被实现为存储器的控制逻辑的一部分。本领域技术人员也知道,除了以纯计算机可读程序代码方式实现控制器以外,完全可以通过将方法步骤进行逻辑编程来使得控制器以逻辑门、开关、专用集成电路、可编程逻辑控制器和嵌入微控制器等的形式来实现相同功能。因此这种控制器可以被认为是一种硬件部件,而对其内包括的 用于实现各种功能的装置也可以视为硬件部件内的结构。或者甚至,可以将用于实现各种功能的装置视为既可以是实现方法的软件模块又可以是硬件部件内的结构。The controller may be implemented in any suitable manner, for example, the controller may take a microprocessor or processor and a computer-readable medium storing computer-readable program code (such as software or firmware) executable by the (micro)processor , Logic gates, switches, application specific integrated circuits (Application Specific Integrated Circuit, ASIC), programmable logic controllers and embedded microcontrollers. Examples of controllers include but are not limited to the following microcontrollers: ARC625D, Atmel AT91SAM, Microchip PIC18F26K20 and Silicon Labs C8051F320, the memory controller can also be implemented as part of the control logic of the memory. Those skilled in the art also know that, in addition to implementing the controller in the form of pure computer-readable program code, it is entirely possible to logically program method steps to make the controller use logic gates, switches, application specific integrated circuits, programmable logic controllers and embedded The same function is realized in the form of a microcontroller or the like. Therefore, such a controller can be regarded as a hardware component, and the devices included therein for realizing various functions can also be regarded as a structure within the hardware component. Or even, the means for realizing various functions can be regarded as both a software module of an implementation method and a structure within a hardware component.
上述实施例阐明的系统、装置、模块或单元,具体可以由计算机芯片或实体实现,或者由具有某种功能的产品来实现。一种典型的实现设备为计算机。具体的,计算机例如可以为个人计算机、膝上型计算机、蜂窝电话、相机电话、智能电话、个人数字助理、媒体播放器、导航设备、电子邮件设备、游戏控制台、平板计算机、可穿戴设备或者这些设备中的任何设备的组合。The system, device, module or unit explained in the above embodiments may be specifically implemented by a computer chip or entity, or implemented by a product with a certain function. A typical implementation device is a computer. Specifically, the computer may be, for example, a personal computer, a laptop computer, a cellular phone, a camera phone, a smart phone, a personal digital assistant, a media player, a navigation device, an email device, a game console, a tablet computer, a wearable device, or A combination of any of these devices.
为了描述的方便,描述以上装置时以功能分为各种单元分别描述。当然,在实施本申请时可以把各单元的功能在同一个或多个软件和/或硬件中实现。For the convenience of description, when describing the above device, the functions are divided into various units and described separately. Of course, when implementing this application, the functions of each unit may be implemented in one or more software and/or hardware.
本领域内的技术人员应明白,本发明的实施例可提供为方法、系统或计算机程序产品。因此,本发明可采用完全硬件实施例、完全软件实施例、或结合软件和硬件方面的实施例的形式。而且,本发明可采用在一个或多个其中包含有计算机可用程序代码的计算机可用存储介质(包括但不限于磁盘存储器、CD-ROM、光学存储器等)上实施的计算机程序产品的形式。Those skilled in the art should understand that the embodiments of the present invention may be provided as methods, systems, or computer program products. Therefore, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware. Moreover, the present invention may take the form of a computer program product implemented on one or more computer usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) containing computer usable program code.
本发明是参照根据本发明实施例的方法、设备(系统)、和计算机程序产品的流程图和/或方框图来描述的。应理解可由计算机程序指令实现流程图和/或方框图中的每一流程和/或方框、以及流程图和/或方框图中的流程和/或方框的结合。可提供这些计算机程序指令到通用计算机、专用计算机、嵌入式处理机或其他可编程数据处理设备的处理器以产生一个机器,使得通过计算机或其他可编程数据处理设备的处理器执行的指令产生用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的装置。The present invention is described with reference to flowcharts and/or block diagrams of methods, devices (systems), and computer program products according to embodiments of the present invention. It should be understood that each flow and/or block in the flowchart and/or block diagram and a combination of the flow and/or block in the flowchart and/or block diagram may be implemented by computer program instructions. These computer program instructions can be provided to the processor of a general-purpose computer, special-purpose computer, embedded processing machine, or other programmable data processing device to produce a machine that enables the generation of instructions executed by the processor of the computer or other programmable data processing device An apparatus for realizing the functions specified in one block or multiple blocks of one flow or multiple flows of a flowchart and/or one block or multiple blocks of a block diagram.
这些计算机程序指令也可存储在能引导计算机或其他可编程数据处理设备以特定方式工作的计算机可读存储器中,使得存储在该计算机可读存储器中的指令产生包括指令装置的制造品,该指令装置实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能。These computer program instructions may also be stored in a computer-readable memory that can guide a computer or other programmable data processing device to work in a specific manner, so that the instructions stored in the computer-readable memory produce an article of manufacture including an instruction device, the instructions The device implements the functions specified in one block or multiple blocks of the flowchart one flow or multiple flows and/or block diagrams.
这些计算机程序指令也可装载到计算机或其他可编程数据处理设备上,使得在计算机或其他可编程设备上执行一系列操作步骤以产生计算机实现的处理,从而在计算机或其他可编程设备上执行的指令提供用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的步骤。These computer program instructions can also be loaded onto a computer or other programmable data processing device, so that a series of operating steps are performed on the computer or other programmable device to produce computer-implemented processing, which is executed on the computer or other programmable device The instructions provide steps for implementing the functions specified in one block or multiple blocks of the flowchart one flow or multiple flows and/or block diagrams.
在一个典型的配置中,计算设备包括一个或多个处理器(CPU)、输入/输出接口、网 络接口和内存。In a typical configuration, the computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.
内存可能包括计算机可读介质中的非永久性存储器,随机存取存储器(RAM)和/或非易失性内存等形式,如只读存储器(ROM)或闪存(flash RAM)。内存是计算机可读介质的示例。Memory may include non-permanent memory, random access memory (RAM) and/or non-volatile memory in computer-readable media, such as read only memory (ROM) or flash memory (flash RAM). Memory is an example of computer-readable media.
计算机可读介质包括永久性和非永久性、可移动和非可移动媒体可以由任何方法或技术来实现信息存储。信息可以是计算机可读指令、数据结构、程序的模块或其他数据。计算机的存储介质的例子包括,但不限于相变内存(PRAM)、静态随机存取存储器(SRAM)、动态随机存取存储器(DRAM)、其他类型的随机存取存储器(RAM)、只读存储器(ROM)、电可擦除可编程只读存储器(EEPROM)、快闪记忆体或其他内存技术、只读光盘只读存储器(CD-ROM)、数字多功能光盘(DVD)或其他光学存储、磁盒式磁带,磁带磁磁盘存储或其他磁性存储设备或任何其他非传输介质,可用于存储可以被计算设备访问的信息。按照本文中的界定,计算机可读介质不包括暂存电脑可读媒体(transitory media),如调制的数据信号和载波。Computer-readable media, including permanent and non-permanent, removable and non-removable media, can store information by any method or technology. The information may be computer readable instructions, data structures, modules of programs, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), static random access memory (SRAM), dynamic random access memory (DRAM), other types of random access memory (RAM), read-only memory (ROM), electrically erasable programmable read-only memory (EEPROM), flash memory or other memory technologies, read-only compact disc read-only memory (CD-ROM), digital versatile disc (DVD) or other optical storage, Magnetic tape cassettes, magnetic tape magnetic disk storage or other magnetic storage devices or any other non-transmission media can be used to store information that can be accessed by computing devices. As defined in this article, computer-readable media does not include temporary computer-readable media (transitory media), such as modulated data signals and carrier waves.
还需要说明的是,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、商品或者设备不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、商品或者设备所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括所述要素的过程、方法、商品或者设备中还存在另外的相同要素。It should also be noted that the terms "include", "include" or any other variant thereof are intended to cover non-exclusive inclusion, so that a process, method, commodity or device that includes a series of elements includes not only those elements, but also includes Other elements not explicitly listed, or include elements inherent to this process, method, commodity, or equipment. Without more restrictions, the element defined by the sentence "include one..." does not exclude that there are other identical elements in the process, method, commodity, or equipment that includes the element.
本申请可以在由计算机执行的计算机可执行指令的一般上下文中描述,例如程序模块。一般地,程序模块包括执行特定任务或实现特定抽象数据类型的例程、程序、对象、组件、数据结构等等。也可以在分布式计算环境中实践本申请,在这些分布式计算环境中,由通过通信网络而被连接的远程处理设备来执行任务。在分布式计算环境中,程序模块可以位于包括存储设备在内的本地和远程计算机存储介质中。The present application may be described in the general context of computer-executable instructions executed by a computer, such as program modules. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform specific tasks or implement specific abstract data types. The present application may also be practiced in distributed computing environments in which tasks are performed by remote processing devices connected through a communication network. In a distributed computing environment, program modules may be located in local and remote computer storage media including storage devices.
本说明书中的各个实施例均采用递进的方式描述,各个实施例之间相同相似的部分互相参见即可,每个实施例重点说明的都是与其他实施例的不同之处。尤其,对于系统实施例而言,由于其基本相似于方法实施例,所以描述的比较简单,相关之处参见方法实施例的部分说明即可。The embodiments in this specification are described in a progressive manner. The same or similar parts between the embodiments can be referred to each other. Each embodiment focuses on the differences from other embodiments. In particular, for the system embodiment, since it is basically similar to the method embodiment, the description is relatively simple, and for related parts, refer to the description of the method embodiment.
以上所述仅为本申请的实施例而已,并不用于限制本申请。对于本领域技术人员来说,本申请可以有各种更改和变化。凡在本申请的精神和原理之内所作的任何修改、等同替换、改进等,均应包含在本申请的权利要求范围之内。The above are only examples of the present application, and are not intended to limit the present application. For those skilled in the art, the present application may have various modifications and changes. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of this application shall be included in the scope of the claims of this application.

Claims (13)

  1. 一种用于个性化语音合成的方法,包括:A method for personalized speech synthesis, including:
    接收用户输入的语音合成TTS模型生成请求,所述TTS模型生成请求中包括目标领域标识;Receiving a speech synthesis TTS model generation request input by a user, where the TTS model generation request includes a target domain identifier;
    向所述用户发送与所述目标领域标识对应的目标录音文本,并接收所述用户返回的与所述目标录音文本对应的音频文件,所述音频文件是所述用户根据所述目标录音文本录制得到的;Sending a target recorded text corresponding to the target domain identifier to the user, and receiving an audio file corresponding to the target recorded text returned by the user, the audio file is recorded by the user according to the target recorded text owned;
    根据所述音频文件,为所述用户生成与所述目标领域标识对应的目标TTS模型,所述目标TTS模型用于提供具有所述用户发音特点的个性化语音合成服务。According to the audio file, a target TTS model corresponding to the target domain identifier is generated for the user, and the target TTS model is used to provide a personalized speech synthesis service with the user's pronunciation characteristics.
  2. 如权利要求1所述的方法,向所述用户发送与所述目标领域标识对应的目标录音文本,包括:The method of claim 1, sending the target recorded text corresponding to the target domain identifier to the user, comprising:
    确定录音文本数据库,所述录音文本数据库中包括不同领域标识对应的录音文本;Determining a recorded text database, the recorded text database includes recorded text corresponding to different field identifiers;
    根据所述录音文本数据库,确定与所述目标领域标识对应的所述目标录音文本;Determine the target recorded text corresponding to the target domain identifier according to the recorded text database;
    向所述用户发送所述目标录音文本。Sending the target recorded text to the user.
  3. 如权利要求2所述的方法,通过以下方式确定得到所述录音文本数据库:The method according to claim 2, determining that the recorded text database is obtained by:
    确定不同领域标识,所述不同领域标识中的任一领域标识对应一个领域;Determining different domain identifiers, any one of the different domain identifiers corresponds to a domain;
    根据预设算法,生成与所述任一领域标识对应的录音文本,在所述任一领域标识对应的录音文本中,包括与所述领域标识对应的领域中常见的字和/或词语。According to a preset algorithm, the recorded text corresponding to the any domain identifier is generated, and the recorded text corresponding to the any domain identifier includes common words and/or words in the domain corresponding to the domain identifier.
  4. 如权利要求3所述的方法,所述领域标识包括下述至少一种:The method of claim 3, the domain identifier comprises at least one of the following:
    儿童故事领域标识、交通领域标识、社会新闻领域标识,和天气预报领域标识。Children's story field logo, traffic field logo, social news field logo, and weather forecast field logo.
  5. 如权利要求1所述的方法,根据所述音频文件,为所述用户生成与所述目标领域标识对应的目标TTS模型,包括:The method of claim 1, generating a target TTS model corresponding to the target domain identifier for the user according to the audio file, comprising:
    对所述音频文件进行预处理,得到处理后音频文件;Preprocessing the audio file to obtain the processed audio file;
    根据所述处理后的音频文件,确定与所述用户发音特点匹配的特征参数;According to the processed audio file, determine the characteristic parameters matching the user's pronunciation characteristics;
    根据所述特征参数,生成所述目标TTS模型。According to the characteristic parameters, the target TTS model is generated.
  6. 如权利要求5所述的方法,所述特征参数包括下述至少一种:The method according to claim 5, wherein the characteristic parameter comprises at least one of the following:
    音调、音色、语速、停顿,和口音。Tone, timbre, speed, pause, and accent.
  7. 如权利要求5所述的方法,对所述音频文件进行预处理,包括下述至少一个步骤:The method according to claim 5, preprocessing the audio file includes at least one of the following steps:
    对所述音频文件进行降噪处理;Performing noise reduction processing on the audio file;
    通过自动语言识别技术,判断所述音频文件是否正确。Determine whether the audio file is correct by automatic language recognition technology.
  8. 如权利要求1所述的方法,还包括:The method of claim 1, further comprising:
    接收语音播报请求,所述语音播报请求中包括与所述用户对应的授权信息;Receiving a voice broadcast request, where the voice broadcast request includes authorization information corresponding to the user;
    根据所述语音播报请求,使用所述目标TTS模型,提供个性化语音合成服务。According to the voice broadcast request, the target TTS model is used to provide a personalized voice synthesis service.
  9. 如权利要求8所述的方法,所述个性化语音合成服务包括下述至少一种:The method of claim 8, the personalized speech synthesis service comprises at least one of the following:
    讲故事、播报天气预报、播报时间,和播报新闻。Tell stories, broadcast weather forecasts, broadcast time, and broadcast news.
  10. 如权利要求8所述的方法,所述语音播报请求来自于所述用户,或经所述用户授权的其他用户。The method of claim 8, the voice broadcast request comes from the user, or another user authorized by the user.
  11. 一种用于个性化语音合成的装置,用于执行如权利要求1-10任一项所述的用于个性化语音合成的方法,所述装置包括:An apparatus for personalized speech synthesis for performing the method for personalized speech synthesis according to any one of claims 1-10, the apparatus comprising:
    接收模块,接收用户输入的TTS模型生成请求,所述TTS模型生成请求中包括目标领域标识;The receiving module receives a TTS model generation request input by a user, and the TTS model generation request includes a target domain identifier;
    发送模块,向所述用户发送与所述目标领域标识对应的目标录音文本;A sending module, sending the target recorded text corresponding to the target domain identifier to the user;
    所述接收模块,接收所述用户返回的与所述目标录音文本对应的音频文件,所述音频文件是所述用户根据所述目标录音文本录制得到的;The receiving module receives an audio file corresponding to the target recorded text returned by the user, and the audio file is obtained by the user according to the target recorded text;
    TTS模型生成模块,根据所述音频文件,为所述用户生成与所述目标领域标识对应的目标TTS模型,所述目标TTS模型用于提供具有所述用户发音特点的个性化语音合成服务。The TTS model generation module generates a target TTS model corresponding to the target domain identifier for the user according to the audio file, and the target TTS model is used to provide a personalized speech synthesis service with the user's pronunciation characteristics.
  12. 一种电子设备,包括:An electronic device, including:
    存储器,存放程序;Memory, store programs;
    处理器,执行所述存储器存储的程序,并具体执行如权利要求1-10任一项所述的用于个性化语音合成的方法。The processor executes the program stored in the memory, and specifically executes the method for personalized speech synthesis according to any one of claims 1-10.
  13. 一种计算机可读存储介质,所述计算机可读存储介质存储一个或多个程序,所述一个或多个程序当被包括多个应用程序的电子设备执行时,使得所述电子设备执行如权利要求1-10任一项所述的用于个性化语音合成的方法。A computer-readable storage medium storing one or more programs, which when executed by an electronic device including a plurality of application programs, causes the electronic device to execute as rights The method for personalized speech synthesis according to any one of claims 1-10 is required.
PCT/CN2019/121852 2018-12-06 2019-11-29 Method and apparatus for customized speech synthesis WO2020114323A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201811489961.8A CN111369966A (en) 2018-12-06 2018-12-06 Method and device for personalized speech synthesis
CN201811489961.8 2018-12-06

Publications (1)

Publication Number Publication Date
WO2020114323A1 true WO2020114323A1 (en) 2020-06-11

Family

ID=70975185

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/121852 WO2020114323A1 (en) 2018-12-06 2019-11-29 Method and apparatus for customized speech synthesis

Country Status (3)

Country Link
CN (1) CN111369966A (en)
TW (1) TW202025135A (en)
WO (1) WO2020114323A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112116904A (en) * 2020-11-20 2020-12-22 北京声智科技有限公司 Voice conversion method, device, equipment and storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1379391A (en) * 2001-04-06 2002-11-13 国际商业机器公司 Method of producing individual characteristic speech sound from text
US20020169610A1 (en) * 2001-04-06 2002-11-14 Volker Luegger Method and system for automatically converting text messages into voice messages
CN1496554A (en) * 2001-02-26 2004-05-12 ���µ�����ҵ��ʽ���� Voice personalization of speech synthesizer
CN102117614A (en) * 2010-01-05 2011-07-06 索尼爱立信移动通讯有限公司 Personalized text-to-speech synthesis and personalized speech feature extraction

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8666746B2 (en) * 2004-05-13 2014-03-04 At&T Intellectual Property Ii, L.P. System and method for generating customized text-to-speech voices
CN103810998B (en) * 2013-12-05 2016-07-06 中国农业大学 Based on the off-line audio recognition method of mobile terminal device and realize method
CN105261355A (en) * 2015-09-02 2016-01-20 百度在线网络技术(北京)有限公司 Voice synthesis method and apparatus
CN107516509B (en) * 2017-08-29 2021-12-28 苏州奇梦者网络科技有限公司 Voice database construction method and system for news broadcast voice synthesis
CN108492819B (en) * 2018-03-30 2020-07-07 浙江吉利控股集团有限公司 Language practice method and device, intelligent vehicle-mounted terminal and storage medium
CN108877765A (en) * 2018-05-31 2018-11-23 百度在线网络技术(北京)有限公司 Processing method and processing device, computer equipment and the readable medium of voice joint synthesis
CN108899013B (en) * 2018-06-27 2023-04-18 广州视源电子科技股份有限公司 Voice search method and device and voice recognition system

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1496554A (en) * 2001-02-26 2004-05-12 ���µ�����ҵ��ʽ���� Voice personalization of speech synthesizer
CN1379391A (en) * 2001-04-06 2002-11-13 国际商业机器公司 Method of producing individual characteristic speech sound from text
US20020169610A1 (en) * 2001-04-06 2002-11-14 Volker Luegger Method and system for automatically converting text messages into voice messages
CN102117614A (en) * 2010-01-05 2011-07-06 索尼爱立信移动通讯有限公司 Personalized text-to-speech synthesis and personalized speech feature extraction

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112116904A (en) * 2020-11-20 2020-12-22 北京声智科技有限公司 Voice conversion method, device, equipment and storage medium
CN112116904B (en) * 2020-11-20 2021-02-23 北京声智科技有限公司 Voice conversion method, device, equipment and storage medium

Also Published As

Publication number Publication date
TW202025135A (en) 2020-07-01
CN111369966A (en) 2020-07-03

Similar Documents

Publication Publication Date Title
US11430442B2 (en) Contextual hotwords
TWI759536B (en) Voiceprint authentication method, account registration method and device
US9990176B1 (en) Latency reduction for content playback
US20220335941A1 (en) Dynamic and/or context-specific hot words to invoke automated assistant
US20210065711A1 (en) Temporary account association with voice-enabled devices
JP7341171B2 (en) Dynamic and/or context-specific hotwords to invoke automated assistants
KR20190064626A (en) Text-to-speech (TTS) provisioning
WO2020114384A1 (en) Voice interaction method and device
US11069351B1 (en) Vehicle voice user interface
US20200265843A1 (en) Speech broadcast method, device and terminal
CN111292734B (en) Voice interaction method and device
US20240013784A1 (en) Speaker recognition adaptation
CN109348068A (en) A kind of information processing method, device and storage medium
WO2020114323A1 (en) Method and apparatus for customized speech synthesis
CN110659361B (en) Conversation method, device, equipment and medium
WO2019176252A1 (en) Information processing device, information processing system, information processing method, and program
US11646035B1 (en) Dialog management system
TW202014915A (en) Interaction method, device, storage medium and operating system
EP3776300A1 (en) Temporary account association with voice-enabled devices
US20230298580A1 (en) Emotionally Intelligent Responses to Information Seeking Questions
US20220399016A1 (en) Presence-based application invocation
US11790898B1 (en) Resource selection for processing user inputs
US20240095320A1 (en) Voice-activated authorization to access additional functionality using a device
CN114937446A (en) Voice synthesis method, device, equipment and storage medium
CN115934887A (en) Man-machine interaction method and device based on conversation prediction

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19893358

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19893358

Country of ref document: EP

Kind code of ref document: A1