CN116343743A

CN116343743A - Speech synthesis method and system based on XTTS

Info

Publication number: CN116343743A
Application number: CN202310207016.9A
Authority: CN
Inventors: 姚森敬; 卢志良; 敖榜; 习伟; 于力; 郭尧; 杨伟; 任正国; 王鹏凯; 廖灿; 郑桦; 黄文琦; 梁凌宇; 辛文成
Original assignee: China Southern Power Grid Artificial Intelligence Technology Co ltd; Southern Power Grid Digital Grid Research Institute Co Ltd
Current assignee: China Southern Power Grid Artificial Intelligence Technology Co ltd; Southern Power Grid Digital Grid Research Institute Co Ltd
Priority date: 2023-03-07
Filing date: 2023-03-07
Publication date: 2023-06-27

Abstract

The invention discloses a speech synthesis method and a speech synthesis system based on XTTS, wherein the speech synthesis method and the speech synthesis system based on XTTS comprise the following steps: after a speaker is selected and a text to be synthesized is input, clicking a play button to perform trial listening of the synthesis effect of the speaker; according to the listening test result, voice synthesis is carried out, synthesis attribute parameters can be adjusted, tag text is added, and after attribute or tag is added, a play button can be clicked again for listening test, and text is optimized; and verifying the synthesis result, generating voice after passing, and storing the generated result, so that batch synthesis can be performed. The speech synthesis method based on the XTTS realizes short and quick effect optimization, thereby better providing speakers and optimized resources for applications such as navigation, outbound and the like, also being capable of quickly experiencing the synthesis effect, being capable of quickly experiencing the synthesis effect without writing codes, manually modifying interface joining and restarting services, realizing tasks of quickly using a synthesis engine to generate required audio and solving synthesis problems such as inaccurate pronunciation, abnormal pause and the like.

Description

Speech synthesis method and system based on XTTS

Technical Field

The invention relates to the technical field of speech synthesis, in particular to a speech synthesis method based on XTTS.

Background

With the deep popularization of unified communication application in the whole network, the scale of users is continuously increased, the telephone traffic pressure of a communication service hot line serving the whole network user is rapidly increased, and meanwhile, with the continuous development of communication service, the communication service range is wider and wider, and the current unified communication service platform is limited by factors such as manpower, working time, knowledge level and the like of the existing manual service, so that the current unified communication service platform is difficult to meet the increasing demand of telephone traffic consultation. The manual agent and the traditional self-service voice response system adopt a key interaction mode, so that the interaction efficiency between the user and the system is greatly limited, the customer waiting time is too long, the customer service experience is poor, and the customer experience is seriously influenced. When the user can not quickly acquire the needed service, the user can switch to the manual service, so that the manual telephone traffic pressure is greatly increased, and the operation cost is increased.

Disclosure of Invention

This section is intended to outline some aspects of embodiments of the invention and to briefly introduce some preferred embodiments. Some simplifications or omissions may be made in this section as well as in the description summary and in the title of the application, to avoid obscuring the purpose of this section, the description summary and the title of the invention, which should not be used to limit the scope of the invention.

The present invention has been made in view of the above-described problems.

Therefore, the technical problems solved by the invention are as follows: the existing voice synthesis method has poor synthesis accuracy, cannot better select speech speed and intonation, and solves the problems of inaccurate pronunciation and abnormal pause.

In order to solve the technical problems, the invention provides the following technical scheme: a method of XTTS-based speech synthesis, comprising:

after a speaker is selected and a text to be synthesized is input, clicking a play button to perform trial listening of the synthesis effect of the speaker;

according to the listening test result, voice synthesis is carried out, synthesis attribute parameters can be adjusted, tag text is added, and after attribute or tag is added, a play button can be clicked again for listening test, and text is optimized;

and verifying the synthesis result, generating voice after passing, and storing the generated result, so that batch synthesis can be performed.

As a preferred embodiment of the XTTS-based speech synthesis method of the present invention, the adjusting synthesis attribute parameters includes: volume, speech speed, intonation, sound effect and background sound;

as a preferred embodiment of the XTTS-based speech synthesis method of the present invention, the adding the tag text includes:

forced pinyin, silence, pause, volume, speech speed, intonation, sound effect, chinese punctuation, english word, surname, number, english number 0, chinese number 1.

As a preferred embodiment of the XTTS-based speech synthesis method of the present invention, the text optimization includes:

attribute configuration, label configuration, text configuration, and text management.

As a preferred embodiment of the XTTS-based speech synthesis method of the present invention, the verifying the synthesis result includes:

checking whether the resource supports the production of template sound resources or not;

the resource check includes: the resource check includes: verifying that the recording file samples must be 16k16bit and in WAV format; checking the number of the record from 1, and the numbers are consistent; checking that the text is consistent with the recording quantity; the check text may not contain special symbols, numbers, and spaces; verifying that the text content cannot be repeated; checking that the text cannot have empty lines;

when the resource verification fails, prompting an error reason, modifying according to the prompt, and re-verifying after correction;

and generating voice after the resource verification is passed, and storing the verification result.

As a preferred embodiment of the XTTS-based speech synthesis method of the present invention, the batch synthesis includes:

the method comprises the steps that texts needing batch synthesis are uploaded in batches based on tasks in a task creating mode, universal attributes are configured based on the tasks, and all texts under the tasks are executed according to the universal attributes;

when the problems of wrong pronunciation, abnormal pause and poor effect exist after the trial listening of part of the text, the attribute can be independently configured for the appointed text through configuration modification of the text label, and the system can customize the attribute according to the appointed text preferentially and then realize the attribute according to a general attribute mode;

when no error exists after text hearing test, saving the hearing test result;

the text and the audio after batch synthesis support batch export, so that a customer can conveniently take the synthesized audio to use in other application systems.

In order to solve the technical problems, the invention provides the following technical scheme: an XTTS-based speech synthesis system, comprising:

the system comprises a synthesis configuration module, a synthesis experience module, a resource management module, a general management module and an interface service module;

the synthesis configuration module is used for experiencing all intelligent customer service voice libraries in the external network, and voice libraries used in the internal network experience project or dedicated customized voice libraries;

the synthesis experience module is used for providing functions of batch synthesis and short, flat and fast effect optimization, so that the synthesis is convenient and efficient to use;

the resource management module is used for meeting the configuration and effectiveness of dictionary and rule personalized resources, and the production, verification and effectiveness of template sound resources, so as to realize the hot loading of the resources;

the general management module is used for managing three aspects of a sound library, a user and a log, and meets the basic requirement of product use;

the interface service module is used for providing four types of interface services of batch synthesis, forwarding service, resource packaging and structuring rule, supporting the butt joint through an http protocol, supporting multiple modes of access, and MRCP needs to provide an IMS assembly.

As a preferred embodiment of the XTTS-based speech synthesis system of the present invention, the resource management module includes:

the system comprises an attribute configuration module, a label configuration module, a text configuration module, a batch synthesis module and a text management module;

the attribute configuration module is used for configuring the synthesized attribute for the task;

the label configuration module is used for configuring the text;

the text configuration module is used for adding labels to the text to be optimized, and in order to avoid the application calling the synthesis service from modifying the original text to serve as the content of the entry, the platform is added with the function of text configuration release;

after the text completes label optimization, selecting the text or task to be released, clicking an on-line release button, and completing on-line real-time release, so that the problem of text pronunciation errors or pause effect can be solved on line in a mode of not modifying the original text;

when the release cancellation needs to be cancelled, after selecting the text, clicking a cancel release button or clicking a delete button based on the task or the text, and then finishing cancelling the released task or content, and synthesizing the released task or content in an online manner according to the judgment of an engine system or other rule modes;

the batch synthesis module is used for uploading texts to be synthesized in batches based on the task in a batch mode by creating the task, configuring general attributes based on the task, and executing all the texts under the task according to the general attributes.

The text management module is used for providing a text management function for the text configuration module and the batch synthesis module, supporting to view the imported or optimized text, supporting to perform fuzzy search through text content, supporting batch addition, deletion and export tasks or texts, and supporting to modify text content.

As a preferred embodiment of the XTTS-based speech synthesis system of the present invention, the attribute configuration module further includes:

speaker, volume, speech speed, intonation, sound effect, sampling rate, background sound, word, number and punctuation, default configuration can be selected and modified;

the speaker includes a database name showing a speaker resource and an authorization;

the volume comprises 0-2, defaulting to 1;

the speech rate comprises 0.5-2, and defaults to 1;

the intonation comprises 0.5-2, defaults to 1;

the sound effects comprise negligence, echo, robot, chorus, underwater and reverberation, default none;

the sampling rate comprises 8k16bit,16k16bit,8kalaw,8kulaw and default 16k16bit;

the 8kalaw,8kulaw is used for sampling of telephone channel usage;

the background sound comprises background sound of seat noise, if background sound resources are needed to be added, the background sound needs to be carried by the contact person, and the background sound is not used by default;

the word comprises the steps of when the English pronunciation is uncertain, the English pronunciation is judged automatically according to letters;

the number comprises automatic judgment when the pronunciation according to the number is not determined;

the punctuation: including unread and speakable, unread by default.

As a preferred embodiment of the XTTS-based speech synthesis system of the present invention, the tag configuration module further includes:

forced pinyin, silence, pause, volume, speech speed, intonation, sound effect, chinese punctuation, english word, surname, number, english number 0 and Chinese number 1, and default configuration can be selected and modified;

the forced pinyin comprises that when a cursor is moved into a sentence or the sentence end, the pronunciation of a word in front of the cursor can be configured;

the silencing comprises the steps that when a cursor is moved to the head of a sentence, the sentence is middle or the end of the sentence, the silencing time length of the position of the cursor can be configured, numbers are input, and the silencing time length is calculated according to ms;

the pausing comprises the steps of moving a cursor to the beginning, the middle or the end of a sentence, and the pausing of the position of the cursor can be configured, wherein the pausing comprises no pause, short pause and long pause, and the pausing is defaulted;

the volume comprises 0-2, defaulting to 1;

the speech rate comprises 0.5-2, and defaults to 1;

the intonation comprises 0.5-2, defaults to 1;

the sound effects comprise negligence, echo, robot, chorus, underwater, reverberation and yin-yang strange gas, and default no sound effect;

the Chinese punctuation comprises unread and read-aloud, and is unread by default;

the English word comprises automatic judgment, letter pronunciation and word judgment, and default automatic judgment;

the surname comprises automatic and forced pinyin reading and defaulting;

the digits comprise automatic judgment, pronunciation according to the number, pronunciation according to the numerical value and automatic judgment by default;

the English number 0 comprises English number 0 read o and English number 0 read zero, and the English number 0 read o is defaulted;

the Chinese number 1 includes 1 read yao and 1 read yi, default 1 read yi.

The invention has the beneficial effects that: the speech synthesis method based on the XTTS provided by the invention is characterized in that the speech synthesis capability is derived from an experiential application platform with an interface, the integrity of the XTTS speech synthesis product is enriched, personalized resources can be customized, zero-development experience synthesis application can be realized, and short and smooth effect optimization is realized, so that speakers and optimized resources can be better provided for applications such as navigation and outbound, the synthesis effect can be experienced quickly, codes are not required to be written, interfaces are not required to be manually modified, services are not required to be restarted, the synthesis effect can be experienced quickly, the task of generating required audio by using a synthesis engine quickly is realized, and the synthesis problems such as inaccurate pronunciation and abnormal pause are solved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed in the description of the embodiments will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art. Wherein:

FIG. 1 is an overall flow chart of a method for XTTS-based speech synthesis in accordance with one embodiment of the invention;

fig. 2 is an overall structure diagram of a speech synthesis system based on XTTS according to a second embodiment of the present invention.

Detailed Description

So that the manner in which the above recited objects, features and advantages of the present invention can be understood in detail, a more particular description of the invention, briefly summarized above, may be had by reference to the embodiments, some of which are illustrated in the appended drawings. All other embodiments, which can be made by one of ordinary skill in the art based on the embodiments of the present invention without making any inventive effort, shall fall within the scope of the present invention.

In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention, but the present invention may be practiced in other ways other than those described herein, and persons skilled in the art will readily appreciate that the present invention is not limited to the specific embodiments disclosed below.

Further, reference herein to "one embodiment" or "an embodiment" means that a particular feature, structure, or characteristic can be included in at least one implementation of the invention. The appearances of the phrase "in one embodiment" in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments.

While the embodiments of the present invention have been illustrated and described in detail in the drawings, the cross-sectional view of the device structure is not to scale in the general sense for ease of illustration, and the drawings are merely exemplary and should not be construed as limiting the scope of the invention. In addition, the three-dimensional dimensions of length, width and depth should be included in actual fabrication.

Also in the description of the present invention, it should be noted that the orientation or positional relationship indicated by the terms "upper, lower, inner and outer", etc. are based on the orientation or positional relationship shown in the drawings, are merely for convenience of describing the present invention and simplifying the description, and do not indicate or imply that the apparatus or elements referred to must have a specific orientation, be constructed and operated in a specific orientation, and thus should not be construed as limiting the present invention. Furthermore, the terms "first, second, or third" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance.

The terms "mounted, connected, and coupled" should be construed broadly in this disclosure unless otherwise specifically indicated and defined, such as: can be fixed connection, detachable connection or integral connection; it may also be a mechanical connection, an electrical connection, or a direct connection, or may be indirectly connected through an intermediate medium, or may be a communication between two elements. The specific meaning of the above terms in the present invention will be understood in specific cases by those of ordinary skill in the art.

Example 1

Referring to fig. 1, for one embodiment of the present invention, there is provided a XTTS-based speech synthesis method, including:

adjusting the composition attribute parameters includes: volume, speech speed, intonation, sound effect and background sound;

adding the tag text includes: forced pinyin, silence, pause, volume, speech speed, intonation, sound effect, chinese punctuation, english word, surname, number, english number 0, chinese number 1;

text optimization includes: attribute configuration, label configuration, text management, and batch synthesis.

Verifying the synthesis result, generating voice after passing, and storing the generated result to perform batch synthesis;

the verification of the synthesis result comprises the following steps: checking whether the resource supports the production of template sound resources or not;

the resource check includes: verifying that the recording file samples must be 16k16bit and in WAV format; checking the number of the record from 1, and the numbers are consistent; checking that the text is consistent with the recording quantity; the check text may not contain special symbols, numbers, and spaces; verifying that the text content cannot be repeated; checking that the text cannot have empty lines;

The batch synthesis includes: the method comprises the steps that texts needing batch synthesis are uploaded in batches based on tasks in a task creating mode, universal attributes are configured based on the tasks, and all texts under the tasks are executed according to the universal attributes;

when no error exists after text hearing test, saving the hearing test result;

The system has the functions of composite experience, composite configuration, resource management and passing management;

the synthesized experience meets the requirements of all intelligent customer service voice libraries in the outside network experience, voice libraries used in the inside network experience project or dedicated customized voice libraries;

the synthesis configuration has the functions of batch synthesis and short, flat and fast effect optimization, so that the synthesis is convenient and efficient to use;

the resource management is to meet the configuration and effectiveness of dictionary and rule personalized resources, and the production, verification and effectiveness of template sound resources, so as to realize the hot loading of the resources. The general management is the management of three aspects of voice library, user and log, and meets the basic requirement of product use.

Example 2

Referring to fig. 2, for one embodiment of the present invention, there is provided an XTTS-based speech synthesis system comprising:

the system comprises a synthesis configuration module 100, a synthesis experience module 200, a resource management module 300, a general management module 400 and an interface service module 500;

the synthesis configuration module 100 is used for experiencing all intelligent customer service voice libraries in the external network, and voice libraries used in the internal network experience project or dedicated customized voice libraries;

the composition experience module 200 is used for providing functions of batch composition and short, flat and fast effect optimization, so that the composition is convenient and efficient to use;

the resource management module 300 is used for configuring and validating personalized resources meeting dictionary and rule, and making, checking and validating template sound resources to realize hot loading of the resources;

the general management module 400 is used for managing three aspects of a voice library, a user and a log, and meets the basic requirement of product use;

the interface service module 500 is configured to provide four types of interface services including batch synthesis, forwarding service, resource packaging and structured rule, support docking through http protocol, support access in multiple modes, and the MRCP needs to provide an IMS component.

The resource management module 300 includes: an attribute configuration module 301, a label configuration module 302, a text configuration module 303, a batch synthesis module 304, and a text management module 305;

the attribute configuration module 301 is configured to configure the composite attribute for the task;

the tag configuration module 302 is configured to configure a text;

the text configuration module 303 is configured to add a tag to the text to be optimized, and in order to avoid the application calling the synthesis service from modifying the original text to serve as the content of the entry, the platform adds a function of text configuration release;

the batch synthesis module 304 firstly uploads texts to be synthesized in batches based on the task in a manner of creating the task, and configures general attributes based on the task, wherein all texts under the task are executed according to the general attributes.

The text management module 305 is configured to provide text management functions for the text configuration module 303 and the batch synthesis module 304, support viewing imported or optimized text, support performing fuzzy search through text content, support batch addition, deletion, and export tasks or text, and support modifying text content.

The attribute configuration module 301 further includes: speaker, volume, speech speed, intonation, sound effect, sampling rate, background sound, word, number and punctuation, default configuration can be selected and modified;

the speaker includes a database name showing the resources and authorizations of the speaker;

the volume comprises 0-2, defaulting to 1;

the speech rate comprises 0.5-2, and defaults to 1;

intonation comprises 0.5-2, defaults to 1;

the sampling rate includes 8k16bit,16k16bit,8kalaw,8kulaw, default 16k16bit;

8kalaw,8kulaw for sampling of telephone channel usage;

the background sound comprises background sound of seat noise, if background sound resources need to be added, the background sound needs to be carried by the contact person, and the background sound is not used by default;

marking: including unread and speakable, unread by default.

The tag configuration module 302 further includes: forced pinyin, silence, pause, volume, speech speed, intonation, sound effect, chinese punctuation, english word, surname, number, english number 0 and Chinese number 1, and default configuration can be selected and modified;

the forced spelling comprises that when the cursor is moved to the sentence or the sentence end, the pronunciation of a word in front of the cursor can be configured;

the mute comprises the steps that when a cursor is moved to the head of a sentence, the sentence is in or at the end of the sentence, the mute time length of the position of the cursor can be configured, and numbers are input and calculated according to ms;

the pause comprises the steps of moving a cursor to the beginning of a sentence, and configuring the pause of the position of the cursor when the cursor is in the sentence or at the end of the sentence, wherein the pause comprises no pause, short pause and long pause, and defaults to no pause;

the volume comprises 0-2, defaulting to 1;

the speech rate comprises 0.5-2, and defaults to 1;

intonation comprises 0.5-2, defaults to 1;

the sound effects comprise neglect, echo, robot, chorus, underwater, reverberation and yin-yang strange gas, and default no sound effect;

english words comprise automatic judgment, letter pronunciation and word judgment, and default automatic judgment;

the surname comprises automatic and forced pinyin reading and defaulting;

the digits comprise automatic judgment, pronunciation according to the number, pronunciation according to the numerical value and default automatic judgment;

english number 0 includes English number 0 read o and English number 0 read zero, default to English number 0 read o;

chinese number 1 includes 1 read yao and 1 read yi, default 1 read yi.

Example 3

For one embodiment of the invention, a speech synthesis method and a system based on XTTS are provided, and in order to verify the beneficial effects of the invention, scientific demonstration is carried out through simulation experiments.

In this embodiment, a specific experiment is performed on the method of the present invention, and in a preset equivalent experimental environment, the method of the present embodiment is performed, and specific experimental results are shown in tables 2-3

Table 1 table of operating conditions:

table 2 working simulation table:

table 3 working performance simulation table:

as can be seen from tables 2-3, the minimum real-time rate of the four speakers at the time of 48 ways is 1.542, and the minimum real-time rate of the number of ways rising again is already lower than 1.5, so that the four speakers at the time of 48 ways can be achieved.

Four-frame resource performance reaches a concurrent 149 paths for a single speaker, a concurrent 124 paths for two speakers, a concurrent 86 paths for five speakers, and a concurrent 110 paths for a single English speaker.

The method of the invention is used by clients, does not need to write codes, does not need to manually modify interfaces to enter parameters, does not need to restart services, can rapidly experience the synthesis effect, realizes the task of rapidly using the synthesis engine to generate the required audio, and solves the synthesis problems of inaccurate pronunciation, abnormal pause and the like.

It should be noted that the above embodiments are only for illustrating the technical solution of the present invention and not for limiting the same, and although the present invention has been described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that the technical solution of the present invention may be modified or substituted without departing from the spirit and scope of the technical solution of the present invention, which is intended to be covered in the scope of the claims of the present invention.

Claims

1. A method of XTTS-based speech synthesis, comprising:

2. The XTTS-based speech synthesis method of claim 1, wherein the adjusting synthesis attribute parameters comprises: volume, speech rate, intonation, sound effects and background sounds.

3. The XTTS-based speech synthesis method of claim 1 or 2, wherein the adding the tag text comprises:

4. The XTTS-based speech synthesis method of claim 1, wherein the text optimization comprises:

5. The XTTS-based speech synthesis method of claim 1, wherein the validating the synthesis result comprises:

6. The XTTS-based speech synthesis method of claim 1, wherein the batch synthesis comprises:

when no error exists after text hearing test, saving the hearing test result;

7. An XTTS-based speech synthesis system, comprising:

a composition configuration module (100), a composition experience module (200), a resource management module (300), a general management module (400), an interface service module (500);

the synthesis configuration module (100) is used for experiencing all intelligent customer service voice libraries in the external network, and voice libraries used in the internal network experience project or exclusive customized voice libraries;

the synthesis experience module (200) is used for providing functions of batch synthesis and short, flat and fast effect optimization, so that the synthesis is convenient and efficient to use;

the resource management module (300) is used for configuring and validating personalized resources meeting dictionary and rule, and making, checking and validating template sound resources to realize resource hot loading;

the general management module (400) is used for managing three aspects of a voice library, a user and a log, and meets the basic requirement of product use;

the interface service module (500) is used for providing four types of interface services of batch synthesis, forwarding service, resource packaging and structural rule, supporting the interface by the http protocol, supporting the access of various modes, and the MRCP needs to provide an IMS assembly.

8. The XTTS-based speech synthesis system of claim 7, wherein the resource management module (300) comprises:

an attribute configuration module (301), a tag configuration module (302), a text configuration module (303), a batch synthesis module (304), and a text management module (305);

the attribute configuration module (301) is configured to configure a composite attribute for a task;

the tag configuration module (302) is used for configuring texts;

the text configuration module (303) is used for adding labels to the text to be optimized, and in order to avoid the application calling the synthesis service from modifying the original text to serve as the content of the entry, the platform is added with the function of text configuration release;

the batch synthesis module (304) is used for uploading texts needing batch synthesis in batches based on tasks in a mode of creating the tasks, configuring general attributes based on the tasks, and executing all the texts under the tasks according to the general attributes.

The text management module (305) is used for providing text management functions for the text configuration module (303) and the batch synthesis module (304), supporting to view imported or optimized texts, supporting to perform fuzzy search through text contents, supporting batch addition, deletion and export tasks or texts, and supporting to modify text contents.

9. The XTTS-based speech synthesis system of claim 8, wherein the attribute configuration module (301) further comprises: speaker, volume, speech speed, intonation, sound effect, sampling rate, background sound, word, number and punctuation, default configuration can be selected and modified;

the volume comprises 0-2, defaulting to 1;

the speech rate comprises 0.5-2, and defaults to 1;

the intonation comprises 0.5-2, defaults to 1;

the 8kalaw,8kulaw is used for sampling of telephone channel usage;

the punctuation: including unread and speakable, unread by default.

10. The XTTS-based speech synthesis system of claim 8, wherein the tag configuration module (302) further comprises: forced pinyin, silence, pause, volume, speech speed, intonation, sound effect, chinese punctuation, english word, surname, number, english number 0 and Chinese number 1, and default configuration can be selected and modified;

the volume comprises 0-2, defaulting to 1;

the speech rate comprises 0.5-2, and defaults to 1;

the intonation comprises 0.5-2, defaults to 1;

the surname comprises automatic and forced pinyin reading and defaulting;

the Chinese number 1 includes 1 read yao and 1 read yi, default 1 read yi.