WO2020114323A1

WO2020114323A1 - Method and apparatus for customized speech synthesis

Info

Publication number: WO2020114323A1
Application number: PCT/CN2019/121852
Authority: WO
Inventors: 孙尧
Original assignee: 阿里巴巴集团控股有限公司
Priority date: 2018-12-06
Filing date: 2019-11-29
Publication date: 2020-06-11
Also published as: TW202025135A; CN111369966A

Abstract

Disclosed are a method and apparatus for customized speech synthesis. The method comprises: receiving a TTS model generation request input by a user, wherein the TTS model generation request comprises a target field identifier (102); sending to the user a target record text corresponding to the target field identifier and receiving an audio file corresponding to the target record text and returned by the user, wherein the audio file is obtained by the user who performs recording according to the target record text (104); and according to the audio file, generating for the user a target TTS model corresponding to the target field identifier, wherein the target TTS model is used for providing a customized speech synthesis service having a pronunciation feature of the user (106).

Description

Method and device for personalized speech synthesis

This application requires the priority of the Chinese patent application with the application number 201811489961.8 and the invention titled "A method and device for personalized speech synthesis" submitted on December 6, 2018, the entire content of which is incorporated by reference in this application in.

Technical field

The present application relates to the field of computer technology, and in particular to a method and device for personalized speech synthesis.

Background technique

Speech synthesis technology, also known as text-to-speech technology (TTS, Text To Speech), can convert text information into speech output. Specifically, first, a large amount of voice data is collected; then, a TTS model is generated based on the collected large amount of voice data; and finally, the text information is converted into voice output according to the TTS model. Because the traditional TTS model construction process needs to collect a large amount of voice data, the construction process of the TTS model is more complicated.

Therefore, a more easily implemented method for personalized speech synthesis is needed.

Summary of the invention

The embodiments of the present specification provide a method and device for personalized speech synthesis, so that the generation process of the TTS model can be simplified.

In a first aspect, an embodiment of this specification provides a method for personalized speech synthesis, including:

Receiving a speech synthesis TTS model generation request input by a user, where the TTS model generation request includes a target domain identifier;

Sending a target recorded text corresponding to the target domain identifier to the user, and receiving an audio file corresponding to the target recorded text returned by the user, the audio file is recorded by the user according to the target recorded text owned;

According to the audio file, a target TTS model corresponding to the target domain identifier is generated for the user, and the target TTS model is used to provide a personalized speech synthesis service with the user's pronunciation characteristics.

In a second aspect, an embodiment of the present specification also provides an apparatus for personalized speech synthesis, for performing the method for personalized speech synthesis as described in the first aspect, the apparatus includes:

The receiving module receives a TTS model generation request input by a user, and the TTS model generation request includes a target domain identifier;

A sending module, sending the target recorded text corresponding to the target domain identifier to the user;

The receiving module receives an audio file corresponding to the target recorded text returned by the user, and the audio file is obtained by the user according to the target recorded text;

The TTS model generation module generates a target TTS model corresponding to the target domain identifier for the user according to the audio file, and the target TTS model is used to provide a personalized speech synthesis service with the user's pronunciation characteristics.

In a third aspect, an embodiment of this specification also provides an electronic device, including:

Memory, store programs;

The processor executes the program stored in the memory, and specifically executes the method for personalized speech synthesis as described in the first aspect.

According to a fourth aspect, the embodiments of the present specification also provide a computer-readable storage medium that stores one or more programs, where the one or more programs are electronic devices including multiple application programs When executed, the electronic device is caused to execute the method for personalized speech synthesis as described in the first aspect.

The at least one technical solution adopted in the embodiments of the present application can achieve the following beneficial effects:

Receive a TTS model generation request including the target domain ID input by the user, send the target recorded text corresponding to the target domain ID to the user, and receive the audio file corresponding to the target recorded text returned by the user. The audio file is recorded by the user according to the target recorded text According to the audio file, a target TTS model corresponding to the target domain identification is generated for the user. The target TTS model is used to provide a personalized speech synthesis service with user pronunciation features, which can simplify the generation process of the TTS model and reduce personality Cost of speech synthesis services.

BRIEF DESCRIPTION

The drawings described herein are used to provide a further understanding of the present application and form a part of the present application. The schematic embodiments and descriptions of the present application are used to explain the present application and do not constitute an undue limitation on the present application. In the drawings:

1 is a schematic flowchart of a method for personalized speech synthesis provided by an embodiment of the present specification;

2 is a schematic structural diagram of an electronic device provided by an embodiment of the present specification;

FIG. 3 is a schematic structural diagram of an apparatus for personalized speech synthesis provided by an embodiment of the present specification.

detailed description

The technical solution of the present application will be described clearly and completely with reference to specific embodiments of the present specification and corresponding drawings. Obviously, the described embodiments are only a part of the embodiments of the present application, but not all the embodiments. Based on the embodiments in this description, all other embodiments obtained by a person of ordinary skill in the art without creative work fall within the protection scope of the present application.

The technical solutions provided by the embodiments of this specification are described in detail below in conjunction with the drawings.

FIG. 1 is a schematic flowchart of a method for personalized speech synthesis provided by an embodiment of the present specification. The method may be as follows.

Step 102: Receive a TTS model generation request input by the user, and the TTS model generation request includes the target domain identifier.

Step 104: Send the target recorded text corresponding to the target domain identifier to the user, and receive the audio file corresponding to the target recorded text returned by the user. The audio file is recorded by the user according to the target recorded text.

In step 106, according to the audio file, a target TTS model corresponding to the target domain identification is generated for the user. The target TTS model is used to provide a personalized speech synthesis service with user pronunciation characteristics.

Among them, sending the target recorded text corresponding to the target domain identifier to the user includes:

Determine the recorded text database, which includes the recorded text corresponding to the identification of different fields;

According to the recorded text database, determine the target recorded text corresponding to the target domain identifier;

Send the target recorded text to the user.

Specifically, the recording text database is determined and obtained in the following manner:

Identify different domain IDs, any one of the domain IDs corresponds to a domain

According to the preset algorithm, the recorded text corresponding to any domain identifier is generated, and the recorded text corresponding to any domain identifier includes common words and/or words in the domain corresponding to the domain identifier.

The domain identification includes at least one of the following:

Children's story field logo, traffic field logo, social news field logo, and weather forecast field logo.

The personalized speech synthesis system determines different areas of daily life based on common sense of life, for example, the field of children's stories, the field of transportation, the field of social news, the field of weather forecast, etc. Any field corresponds to a field logo, for example, the children's story field corresponds to the children's story field logo, the traffic field corresponds to the traffic field logo, the social news field corresponds to the social news field logo, the weather forecast field corresponds to the weather forecast field logo, etc.

According to the preset algorithm, the optimal recorded text corresponding to any field is generated, that is, the recorded text corresponding to the identification of any field. The recorded text corresponding to any field includes words and/or words that are common in the field.

For example, according to a preset algorithm, an optimal recorded text corresponding to the field of children's stories is generated, and the recorded text includes words and/or words common in the field of children's stories.

It should be noted that the preset algorithm can be determined according to actual conditions, and is not specifically limited here.

The optimal recorded text corresponding to any field contains the main Chinese syllables corresponding to the common words and/or words in the field, and try to avoid repetition to simplify the data volume of the recorded text.

For the best recorded text corresponding to any field, according to the normal speaking rate, try to control the audio file corresponding to the recorded text within the preset duration (for example, 20 to 60 minutes) to improve the speed of acquiring the audio file .

In addition, since the optimal recorded text corresponding to any field needs to be adapted to common words and/or words in the field, the recorded text may not have a complete plot.

When the user needs to build a TTS model, he can log in to the application corresponding to the personalized speech synthesis system (hereinafter referred to as APP) on the smart terminal, and select the target domain identifier in the application, so that the personalized speech synthesis system receives the target Domain identification TTS model generation request.

The personalized speech synthesis system finds the target recorded text corresponding to the target domain identification from the recorded text database, and sends the target recorded text to the APP in the user's smart terminal.

After receiving the target recorded text, the user can record the audio file corresponding to the target recorded text through his own smart terminal in a quiet surrounding environment, and then send the recorded audio file to the cloud private corresponding to the personalized speech synthesis system TTS storage and modeling space.

In the embodiment of the present specification, according to the audio file, a target TTS model corresponding to the target domain identification is generated for the user, including:

Pre-process audio files to get processed audio files;

According to the processed audio file, determine the characteristic parameters that match the user's pronunciation characteristics;

According to the characteristic parameters, the target TTS model is generated.

Among them, the characteristic parameters include at least one of the following:

Tone, timbre, speed, pause, and accent.

Preprocessing audio files includes at least one of the following steps:

Perform noise reduction processing on audio files;

Through automatic language recognition technology, determine whether the audio file is correct.

In the cloud private TTS storage and modeling space corresponding to the personalized speech synthesis system, the TTS model generation module first performs noise reduction processing on the audio file corresponding to the target recorded text, and then uses automatic language recognition (ASR, Automatic Speech Recognition) technology Convert the audio file after noise reduction into a text file, and then match the text file with the target recorded text to determine whether the audio file is correct. If the audio file is correct, the processed audio file is obtained.

Personalized TTS modeling based on the processed audio file to obtain the closest feature parameters to the processed audio file, that is, the feature parameters that match the user's pronunciation characteristics, where the feature parameters include but are not limited to: tone, timbre, Speaking speed, pauses, accents, etc.

Therefore, according to the characteristic parameters matching the user's pronunciation characteristics, a target TTS model that generates a personalized speech synthesis service with the user's pronunciation characteristics in the field corresponding to the target domain identification can be provided.

The audio file is obtained by recording the target recorded text through the user's own smart terminal, and then the target TTS model is generated from the audio file, which effectively simplifies the generation process of the TTS model, and compared to the recording audio file recorded in the studio in the prior art, it can be greatly Save recording costs.

For the generated target TTS model, the personalized speech synthesis system provides cloud services, that is, the target TTS model can be called by an intelligent terminal authorized by the user.

The embodiment of this specification also includes:

Receive a voice broadcast request, which includes authorization information corresponding to the user;

According to the voice broadcast request, use the target TTS model to provide personalized voice synthesis services.

Among them, the personalized speech synthesis service includes at least one of the following:

Tell stories, broadcast weather forecasts, broadcast time, and broadcast news.

The voice broadcast request comes from the user who sent the TTS model generation request, or another user authorized by the user.

When the personalized speech synthesis system receives a voice broadcast request containing authorization information corresponding to the user, it can call the target TTS model corresponding to the user stored in the cloud, and then provide a personalized speech synthesis service according to the target TTS model.

In an embodiment, the personalized speech synthesis system generates a target TTS model corresponding to the child story domain identification for user A. So that when user A is at work and unable to accompany his children, their children can access the cloud service of the personalized speech synthesis system through the smart device at home, requesting "Daddy tell me the story of a piggy page", personalized speech synthesis system The corresponding private cloud server recognizes that the user A’s child is authorized by user A to visit, and can call the child’s nickname, such as “Doudou, Dad tells you a story”. Then, according to the voice of user A generated by the target TTS model, the story of Piggy Page (where the children's story itself comes from the public cloud server corresponding to the smart device) can be told.

In another embodiment, the personalized speech synthesis system generates a target TTS model corresponding to the weather forecast domain identifier for user B. When the parents of user B living in the countryside can access the cloud service of the personalized speech synthesis system to query the weather through a smart device authorized by user B at home (for example, log in to the account corresponding to user B), they can be generated according to the target TTS model The voice of user B broadcasts the weather, reminding the parents of user B to pay attention to the weather changes, so that the parents of user B can feel the warm affection.

In another embodiment, after the personalized speech synthesis system generates the target TTS model for user C, if user C dies, but the relatives of user C can still use the smart device authorized by user C (for example, the user corresponding to user C is logged in Account), access the cloud service of the personalized speech synthesis system, and then broadcast the weather, tell stories, broadcast news, tell jokes, etc. according to the voice of user C generated by the target TTS model, so that relatives can still feel the companionship of user C.

In the embodiment of the present specification, when the domain corresponding to the received voice broadcast request and the target domain identifier corresponding to the target TTS model are inconsistent, if the target TTS model is still used to provide a personalized voice synthesis service, the broadcast effect will be poor. At this time, the full-field TTS model stored in the public cloud server can be invoked to provide users with better speech synthesis services.

Among them, the full-field TTS model stored in the public cloud server may be constructed according to the prior art by collecting a large amount of voice data, or may be constructed by other methods, which is not specifically limited here.

The technical solution described in the embodiment of the present specification receives a TTS model generation request including a target domain identifier input by a user, sends a target recorded text corresponding to the target domain identifier to the user, and receives an audio file corresponding to the target recorded text returned by the user, The audio file is obtained by the user according to the target recording text, and then according to the audio file, a target TTS model corresponding to the target domain identification is generated for the user. The target TTS model is used to provide a personalized speech synthesis service with user pronunciation features, which can be simplified The generation process of the TTS model reduces the cost of personalized speech synthesis services.

2 is a schematic structural diagram of an electronic device according to an embodiment of the present specification. As shown in FIG. 2, at the hardware level, the electronic device includes a processor, and optionally also includes an internal bus, a network interface, and a memory. The memory may include a memory, such as a high-speed random access memory (Random-Access Memory, RAM), or may also include a non-volatile memory (non-volatile memory), such as at least one disk memory. Of course, the electronic device may also include hardware required for other services.

The processor, network interface and memory can be connected to each other through an internal bus, which can be an ISA (Industry Standard Architecture, Industry Standard Architecture) bus, a PCI (Peripheral Component Interconnect) bus, or an EISA (Extended Industry, Standard Architecture, extended industry standard structure) bus, etc. The bus can be divided into an address bus, a data bus, and a control bus. For ease of representation, only one bidirectional arrow is used in FIG. 2, but it does not mean that there is only one bus or one type of bus.

Memory, store programs. Specifically, the program may include program code, and the program code includes a computer operation instruction. The memory may include memory and non-volatile memory, and provide instructions and data to the processor.

The processor reads the corresponding computer program from the non-volatile memory into the memory and then runs it to form a device for personalized speech synthesis at a logical level. The processor executes the program stored in the memory, and specifically executes the steps of the method embodiment shown in FIG. 1.

The above method as shown in FIG. 1 may be applied to the processor, or implemented by the processor. The processor may be an integrated circuit chip with signal processing capabilities. In the implementation process, each step of the above method may be completed by an integrated logic circuit of hardware in the processor or instructions in the form of software. The aforementioned processor may be a general-purpose processor, including a central processor (Central Processing Unit, CPU), a network processor (Network Processor, NP), etc.; it may also be a digital signal processor (Digital Signal Processor, DSP), dedicated integration Circuit (Application Specific Integrated Circuit, ASIC), field programmable gate array (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gates or transistor logic devices, discrete hardware components. The methods, steps, and logical block diagrams disclosed in the embodiments of the present specification can be implemented or executed. The general-purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the method disclosed in conjunction with the embodiments of the present specification may be directly embodied and executed by a hardware decoding processor, or may be executed and completed by a combination of hardware and software modules in the decoding processor. The software module may be located in a mature storage medium in the art, such as random access memory, flash memory, read-only memory, programmable read-only memory, or electrically erasable programmable memory, and registers. The storage medium is located in the memory. The processor reads the information in the memory and completes the steps of the above method in combination with its hardware.

The electronic device can execute the method executed by the method embodiment shown in FIG. 1 and implement the functions of the method embodiment shown in FIG. 1, and the embodiments of this specification will not be described here.

The embodiments of the present specification also propose a computer-readable storage medium that stores one or more programs, and the one or more programs include instructions, which are executed by an electronic device that includes multiple application programs At this time, the electronic device can execute the method for personalized speech synthesis in the embodiment shown in FIG. 1, and specifically perform the steps of the method embodiment shown in FIG.

FIG. 3 is a schematic structural diagram of an apparatus for personalized speech synthesis provided by an embodiment of the present specification. The apparatus 300 shown in FIG. 3 may be used to perform the method of the embodiment shown in FIG. 1 above. The apparatus 300 includes:

The receiving module 301 receives the TTS model generation request input by the user, and the TTS model generation request includes the target domain identifier;

The sending module 302 sends the target recorded text corresponding to the target domain identifier to the user;

The receiving module 301 receives an audio file corresponding to the target recorded text returned by the user, and the audio file is recorded by the user according to the target recorded text;

The TTS model generation module 303 generates a target TTS model corresponding to the target domain identifier for the user according to the audio file. The target TTS model is used to provide a personalized speech synthesis service with user pronunciation characteristics.

Optionally, the sending module 302 further includes:

The first determining unit determines the recorded text database, and the recorded text database includes the recorded text corresponding to the identifiers of different fields;

The second determining unit determines the target recorded text corresponding to the target domain identifier according to the recorded text database;

The sending unit sends the target recorded text to the user.

Optionally, the recording text database is determined to be obtained in the following manner:

Identify different domain IDs, and any domain ID in different domain IDs corresponds to a domain;

Optionally, the domain identifier includes at least one of the following:

Optionally, the TTS model generation module 303 further includes:

The pre-processing unit preprocesses the audio file to obtain the processed audio file;

The third determining unit determines the characteristic parameters matching the user's pronunciation characteristics according to the processed audio file;

The generating unit generates the target TTS model according to the characteristic parameters.

Optionally, the characteristic parameters include at least one of the following:

Tone, timbre, speed, pause, and accent.

Optionally, the pre-processing unit is specifically used for:

Perform noise reduction processing on audio files;

Through automatic language recognition technology, judge whether the audio file is correct.

Optionally, the device 300 further includes:

The receiving module 301 receives a voice broadcast request, and the voice broadcast request includes authorization information corresponding to the user;

The service module uses the target TTS model according to the voice broadcast request to provide personalized voice synthesis services.

Optionally, the personalized speech synthesis service includes at least one of the following:

Tell stories, broadcast weather forecasts, broadcast time, and broadcast news.

Optionally, the voice broadcast request comes from the user, or another user authorized by the user.

According to the device for personalized speech synthesis, the receiving module receives the TTS model generation request input by the user, and the TTS model generation request includes the target domain identifier; the sending module sends the target recorded text corresponding to the target domain identifier to the user; the receiving module receives the user The returned audio file corresponding to the target recorded text. The audio file is recorded by the user according to the target recorded text; the TTS model generation module generates a target TTS model corresponding to the target domain identifier for the user according to the audio file. The target TTS model is used to provide Personalized speech synthesis service with user's pronunciation features, which can simplify the generation process of TTS model and reduce the cost of personalized speech synthesis service.

In the 1990s, the improvement of a technology can be clearly distinguished from the improvement of hardware (for example, the improvement of the circuit structure of diodes, transistors, switches, etc.) or the improvement of software (the improvement of the process flow). However, with the development of technology, the improvement of many methods and processes can be regarded as a direct improvement of the hardware circuit structure. Designers almost get the corresponding hardware circuit structure by programming the improved method flow into the hardware circuit. Therefore, it cannot be said that the improvement of a method flow cannot be realized by hardware physical modules. For example, a programmable logic device (Programmable Logic Device, PLD) (such as a field programmable gate array (Field Programmable Gate Array, FPGA)) is such an integrated circuit, and its logic function is determined by the user programming the device. Designers can program themselves to "integrate" a digital system on a PLD without having to ask chip manufacturers to design and make dedicated integrated circuit chips. Moreover, nowadays, instead of manually making integrated circuit chips, this kind of programming is also mostly implemented with "logic compiler" software, which is similar to the software compiler used in program development and writing, but before compilation The original code must also be written in a specific programming language, which is called hardware description language (Hardware Description Language, HDL), and HDL is not only one, but there are many, such as ABEL (Advanced Boolean Expression) Language , AHDL (AlteraHardwareDescriptionLanguage), Confluence, CUPL (CornellUniversityProgrammingLanguage), HDCal, JHDL (JavaHardwareDescriptionLanguage), Lava, Lola, MyHDL, PALASM, RHDL (RubyHardwareDescription) It is VHDL (Very-High-Speed Integrated Circuit Hardware Description) and Verilog. Those skilled in the art should also be aware that it is easy to obtain the hardware circuit that implements the logic method flow by only slightly programming the method flow in the above hardware description languages and programming it into the integrated circuit.

The controller may be implemented in any suitable manner, for example, the controller may take a microprocessor or processor and a computer-readable medium storing computer-readable program code (such as software or firmware) executable by the (micro)processor , Logic gates, switches, application specific integrated circuits (Application Specific Integrated Circuit, ASIC), programmable logic controllers and embedded microcontrollers. Examples of controllers include but are not limited to the following microcontrollers: ARC625D, Atmel AT91SAM, Microchip PIC18F26K20 and Silicon Labs C8051F320, the memory controller can also be implemented as part of the control logic of the memory. Those skilled in the art also know that, in addition to implementing the controller in the form of pure computer-readable program code, it is entirely possible to logically program method steps to make the controller use logic gates, switches, application specific integrated circuits, programmable logic controllers and embedded The same function is realized in the form of a microcontroller or the like. Therefore, such a controller can be regarded as a hardware component, and the devices included therein for realizing various functions can also be regarded as a structure within the hardware component. Or even, the means for realizing various functions can be regarded as both a software module of an implementation method and a structure within a hardware component.

The system, device, module or unit explained in the above embodiments may be specifically implemented by a computer chip or entity, or implemented by a product with a certain function. A typical implementation device is a computer. Specifically, the computer may be, for example, a personal computer, a laptop computer, a cellular phone, a camera phone, a smart phone, a personal digital assistant, a media player, a navigation device, an email device, a game console, a tablet computer, a wearable device, or A combination of any of these devices.

For the convenience of description, when describing the above device, the functions are divided into various units and described separately. Of course, when implementing this application, the functions of each unit may be implemented in one or more software and/or hardware.

Those skilled in the art should understand that the embodiments of the present invention may be provided as methods, systems, or computer program products. Therefore, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware. Moreover, the present invention may take the form of a computer program product implemented on one or more computer usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) containing computer usable program code.

The present invention is described with reference to flowcharts and/or block diagrams of methods, devices (systems), and computer program products according to embodiments of the present invention. It should be understood that each flow and/or block in the flowchart and/or block diagram and a combination of the flow and/or block in the flowchart and/or block diagram may be implemented by computer program instructions. These computer program instructions can be provided to the processor of a general-purpose computer, special-purpose computer, embedded processing machine, or other programmable data processing device to produce a machine that enables the generation of instructions executed by the processor of the computer or other programmable data processing device An apparatus for realizing the functions specified in one block or multiple blocks of one flow or multiple flows of a flowchart and/or one block or multiple blocks of a block diagram.

These computer program instructions may also be stored in a computer-readable memory that can guide a computer or other programmable data processing device to work in a specific manner, so that the instructions stored in the computer-readable memory produce an article of manufacture including an instruction device, the instructions The device implements the functions specified in one block or multiple blocks of the flowchart one flow or multiple flows and/or block diagrams.

These computer program instructions can also be loaded onto a computer or other programmable data processing device, so that a series of operating steps are performed on the computer or other programmable device to produce computer-implemented processing, which is executed on the computer or other programmable device The instructions provide steps for implementing the functions specified in one block or multiple blocks of the flowchart one flow or multiple flows and/or block diagrams.

In a typical configuration, the computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.

Memory may include non-permanent memory, random access memory (RAM) and/or non-volatile memory in computer-readable media, such as read only memory (ROM) or flash memory (flash RAM). Memory is an example of computer-readable media.

Computer-readable media, including permanent and non-permanent, removable and non-removable media, can store information by any method or technology. The information may be computer readable instructions, data structures, modules of programs, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), static random access memory (SRAM), dynamic random access memory (DRAM), other types of random access memory (RAM), read-only memory (ROM), electrically erasable programmable read-only memory (EEPROM), flash memory or other memory technologies, read-only compact disc read-only memory (CD-ROM), digital versatile disc (DVD) or other optical storage, Magnetic tape cassettes, magnetic tape magnetic disk storage or other magnetic storage devices or any other non-transmission media can be used to store information that can be accessed by computing devices. As defined in this article, computer-readable media does not include temporary computer-readable media (transitory media), such as modulated data signals and carrier waves.

It should also be noted that the terms "include", "include" or any other variant thereof are intended to cover non-exclusive inclusion, so that a process, method, commodity or device that includes a series of elements includes not only those elements, but also includes Other elements not explicitly listed, or include elements inherent to this process, method, commodity, or equipment. Without more restrictions, the element defined by the sentence "include one..." does not exclude that there are other identical elements in the process, method, commodity, or equipment that includes the element.

The present application may be described in the general context of computer-executable instructions executed by a computer, such as program modules. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform specific tasks or implement specific abstract data types. The present application may also be practiced in distributed computing environments in which tasks are performed by remote processing devices connected through a communication network. In a distributed computing environment, program modules may be located in local and remote computer storage media including storage devices.

The embodiments in this specification are described in a progressive manner. The same or similar parts between the embodiments can be referred to each other. Each embodiment focuses on the differences from other embodiments. In particular, for the system embodiment, since it is basically similar to the method embodiment, the description is relatively simple, and for related parts, refer to the description of the method embodiment.

The above are only examples of the present application, and are not intended to limit the present application. For those skilled in the art, the present application may have various modifications and changes. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of this application shall be included in the scope of the claims of this application.

Claims

A method for personalized speech synthesis, including:

Receiving a speech synthesis TTS model generation request input by a user, where the TTS model generation request includes a target domain identifier;

Sending a target recorded text corresponding to the target domain identifier to the user, and receiving an audio file corresponding to the target recorded text returned by the user, the audio file is recorded by the user according to the target recorded text owned;

According to the audio file, a target TTS model corresponding to the target domain identifier is generated for the user, and the target TTS model is used to provide a personalized speech synthesis service with the user's pronunciation characteristics.
The method of claim 1, sending the target recorded text corresponding to the target domain identifier to the user, comprising:

Determining a recorded text database, the recorded text database includes recorded text corresponding to different field identifiers;

Determine the target recorded text corresponding to the target domain identifier according to the recorded text database;

Sending the target recorded text to the user.
The method according to claim 2, determining that the recorded text database is obtained by:

Determining different domain identifiers, any one of the different domain identifiers corresponds to a domain;

According to a preset algorithm, the recorded text corresponding to the any domain identifier is generated, and the recorded text corresponding to the any domain identifier includes common words and/or words in the domain corresponding to the domain identifier.
The method of claim 3, the domain identifier comprises at least one of the following:

Children's story field logo, traffic field logo, social news field logo, and weather forecast field logo.
The method of claim 1, generating a target TTS model corresponding to the target domain identifier for the user according to the audio file, comprising:

Preprocessing the audio file to obtain the processed audio file;

According to the processed audio file, determine the characteristic parameters matching the user's pronunciation characteristics;

According to the characteristic parameters, the target TTS model is generated.
The method according to claim 5, wherein the characteristic parameter comprises at least one of the following:

Tone, timbre, speed, pause, and accent.
The method according to claim 5, preprocessing the audio file includes at least one of the following steps:

Performing noise reduction processing on the audio file;

Determine whether the audio file is correct by automatic language recognition technology.
The method of claim 1, further comprising:

Receiving a voice broadcast request, where the voice broadcast request includes authorization information corresponding to the user;

According to the voice broadcast request, the target TTS model is used to provide a personalized voice synthesis service.
The method of claim 8, the personalized speech synthesis service comprises at least one of the following:

Tell stories, broadcast weather forecasts, broadcast time, and broadcast news.
The method of claim 8, the voice broadcast request comes from the user, or another user authorized by the user.
An apparatus for personalized speech synthesis for performing the method for personalized speech synthesis according to any one of claims 1-10, the apparatus comprising:

The receiving module receives a TTS model generation request input by a user, and the TTS model generation request includes a target domain identifier;

A sending module, sending the target recorded text corresponding to the target domain identifier to the user;

The receiving module receives an audio file corresponding to the target recorded text returned by the user, and the audio file is obtained by the user according to the target recorded text;

The TTS model generation module generates a target TTS model corresponding to the target domain identifier for the user according to the audio file, and the target TTS model is used to provide a personalized speech synthesis service with the user's pronunciation characteristics.
An electronic device, including:

Memory, store programs;

The processor executes the program stored in the memory, and specifically executes the method for personalized speech synthesis according to any one of claims 1-10.
A computer-readable storage medium storing one or more programs, which when executed by an electronic device including a plurality of application programs, causes the electronic device to execute as rights The method for personalized speech synthesis according to any one of claims 1-10 is required.