CN116798400A

CN116798400A - Speech synthesis method and system based on computer program

Info

Publication number: CN116798400A
Application number: CN202210237919.7A
Authority: CN
Inventors: 雷文辉
Original assignee: Porsche Shanghai Digital Technology Co ltd
Current assignee: Porsche Shanghai Digital Technology Co ltd
Priority date: 2022-03-11
Filing date: 2022-03-11
Publication date: 2023-09-22

Abstract

A computer program-based speech synthesis method, comprising: acquiring a front-end model and a back-end model for speech synthesis; acquiring a universal voiceprint model to generate a reference speech synthesis engine; model-adaptively adjusting the generic voiceprint model based on the collected acoustic feature data of the at least one speaker of interest to generate a corresponding customized voiceprint model for the at least one speaker of interest; generating a respective customized speech synthesis engine for the at least one speaker of interest; processing text to be read using a corresponding customized speech synthesis engine of the one of the at least one speaker of interest selected by the user to generate a corresponding speech having acoustic characteristics of the one of the user selected. According to the method, the system and the vehicle, a speech synthesis engine oriented to users and content customization can be generated, and rich personalized experience is provided for the users.

Description

Speech synthesis method and system based on computer program

Technical Field

The present invention relates to the field of speech synthesis, and in particular, to a computer program-based speech synthesis method, system, corresponding vehicle, computer device, and computer-readable storage medium.

Background

Speech synthesis (Text-To-Speech, also known as "Text-To-Speech") is one of the most important feedback when a machine interacts with a human being, as it enables the input Text To be automatically Speech-generated by the machine.

In recent years, voice assistants (voice assistances) have become popular in the field of smart vehicles and smart phones, which are capable of providing simple voice recognition, natural language understanding, voice synthesis techniques, and enriching user experience based on these advantageous techniques, for example: broadcasting navigation and weather inquiry, music searching and playing, news reading, vehicle control and the like.

In the vehicle industry, suppliers of speech synthesis models provide their speech synthesis models to original equipment manufacturers or primary manufacturing suppliers of vehicles to enable the speech synthesis models to be integrated into electronic and electrical systems (E/E systems) and use the speech synthesis models to read out speech with input text. For example, the driver inquires the weather information from the internet and responds accordingly by inquiring the information from the voice assistant, for example, if the inquired information is "the open day is a sunny day and the temperature is 26 degrees", the information is displayed on the display screen of the vehicle and is read out through the voice synthesis technology. In addition, the speech style of speech synthesis in a vehicle is identical to the voice of a speech speaker determined in a speech synthesis model provided by the vehicle manufacturer or vendor.

However, when facing the needs of new generation users, existing speech synthesis techniques also suffer from a number of drawbacks: firstly, the sound styles of speakers for speech synthesis used by the same vehicle model and even the same vehicle brand are the same and specific, and hardly differ from each other; furthermore, since the data training process of speech synthesis is time consuming and expensive, it is almost impossible to generate a unique speech synthesis engine model for each user; finally, the speech styles of the corresponding speech synthesis are the same for different contents, and the user cannot select his own favorite sound style for some special contents.

Thus, there is a need for a new, personalized, improved solution for content-oriented and user-customized speech synthesis.

Disclosure of Invention

To improve at least one of the above problems, the present invention provides a computer program-based speech synthesis method, system, corresponding vehicle, computer device and computer-readable storage medium.

According to a first aspect of the present invention, there is provided a computer program-based speech synthesis method, the method comprising:

acquiring a front-end model and a back-end model for speech synthesis, wherein the front-end model at least represents a model for analyzing and processing text, and the back-end model at least represents a model for representing acoustic characteristics of speech of one speaker;

acquiring a universal voiceprint model and combining the universal voiceprint model with the front-end model and the back-end model to generate a reference speech synthesis engine, wherein the universal voiceprint model is obtained by performing acoustic feature extraction on the voices of a plurality of speakers and training by using machine learning;

collecting respective voice samples of at least one speaker of interest and extracting acoustic feature data in the voice samples, performing model adaptation on the generic voiceprint model based on the acoustic feature data of the at least one speaker of interest to generate respective customized voiceprint models for the at least one speaker of interest;

combining the respective customized voiceprint models of the at least one speaker of interest with the front-end model and the back-end model to generate respective customized speech synthesis engines for the at least one speaker of interest based on adjusting the baseline speech synthesis engine;

processing text to be read using a corresponding customized speech synthesis engine of the one of the at least one speaker of interest selected by the user to generate a corresponding speech having acoustic characteristics of the one of the user selected.

According to a second aspect of the present invention, there is provided a computer program-based speech synthesis system, the system comprising:

a customized speech synthesis engine generation unit configured to combine respective customized voiceprint models of at least one speaker of interest with a front-end model and a back-end model to generate respective customized speech synthesis engines for the at least one speaker of interest based on adjustments to a reference speech synthesis engine, wherein the front-end model represents at least a model that performs an analysis process on text and the back-end model represents at least a model that characterizes acoustic features of speech of one speaker;

a customized speech generating unit configured to process text to be read out using a corresponding customized speech synthesis engine of a user selected one of the at least one speaker of interest to generate a corresponding speech having acoustic characteristics of the user selected one of the speaker of interest.

According to a third aspect of the present invention there is provided a vehicle comprising a computer program based speech synthesis system as described in any of the embodiments of the second aspect above.

According to a fourth aspect of the present invention there is provided a computer device comprising a memory and a processor, the memory having stored thereon computer instructions executable by the processor, which when executed by the processor, instruct the processor to perform a computer program-based speech synthesis method according to any of the embodiments of the first aspect described above.

According to a fifth aspect of the present invention, there is provided a computer readable storage medium having stored thereon a computer program which, when executed by a processor, causes a computer program-based implementation of a speech synthesis method according to any of the embodiments of the first aspect described above to be performed.

The voice synthesis method, the device, the vehicle, the computer equipment and the computer readable storage medium based on the computer program can enable a user to obtain various voice styles expected by the user on the basis of the original voice synthesis system and fully display the individuality of the user; furthermore, since the present invention provides a very smooth and natural way to obtain a new speech synthesis system, neither the user nor the speech synthesis engine provider need to expend too much cost and effort for this; the user can also select different speech synthesis styles to read different contents, and the invention can provide rich personalized experience for the user through low cost and simple operation.

Drawings

Non-limiting and non-exhaustive embodiments of the present invention are described by way of example with reference to the following drawings, wherein:

fig. 1 is a schematic flow chart of a computer program-based implemented speech synthesis method according to an embodiment of the invention.

Fig. 2 is a schematic diagram of a computer program-based implementation of a speech synthesis method according to an embodiment of the invention.

FIG. 3 is a schematic representation of the generation of a voiceprint model of a speaker of interest in accordance with one embodiment of the present invention.

Fig. 4 is a simplified voiceprint model space diagram in accordance with one embodiment of the present invention.

Fig. 5 is a schematic diagram of a computer program-based implementation of a speech synthesis method according to one embodiment of the invention.

Fig. 6 is a schematic flow chart of a computer program based implemented speech synthesis method according to another embodiment of the invention.

Fig. 7 is a schematic diagram of a computer program-based speech synthesis system according to one embodiment of the invention.

Detailed Description

To further clarify the above and other features and advantages of the present invention, a further description of the invention will be rendered by reference to the appended drawings. It should be understood that the specific embodiments presented herein are for purposes of explanation to those skilled in the art and are intended to be illustrative only and not limiting.

Fig. 1 schematically shows a computer program based speech synthesis method S100 according to an embodiment of the invention. The method S100 may include step S110, step S120, step S130, step S140, and step S150.

In step S110, a front-end model M for speech synthesis is acquired _FE And back-end model M _BE Wherein the front-end model M _FE At least representing a model for analyzing text, the back-end model M _BE Representing at least a model characterizing acoustic features of a speaker's speech.

In step S120, a general voiceprint model M is obtained _VG And model the general voiceprint M _VG And the front end model M _FE And the back-end model M _BE Combining to generate a reference speech synthesis engine, wherein the generic voiceprint model is based on acoustic feature extraction of speech from a plurality of speakers and is trained using machine learning.

At step S130, corresponding speech samples of at least one speaker of interest are collected and acoustic feature data in the speech samples are extracted, and the generic voiceprint model M is based on the acoustic feature data of the at least one speaker of interest _VG Model adaptation is performed to generate a respective customized voiceprint model for the at least one speaker of interest.

In step S140, a respective customized voiceprint model of the at least one speaker of interest is compared to the front-end model M _FE And the back-end model M _BE In combination to generate a respective customized speech synthesis engine for the at least one speaker of interest based on the adjustment to the baseline speech synthesis engine.

In step S150, the text to be read out is processed using the corresponding customized speech synthesis engine of the one of the at least one speaker of interest selected by the user to generate a corresponding speech having the acoustic characteristics of the one of the user selected.

In one embodiment, the front-end model M _FE At least comprises a word segment model and a prosody model, wherein the front end model M _FE Is obtained by training at least the following method:

acquiring a text corpus;

identifying text data in the text corpus; and

the text data is analyzed by a machine learning method to at least train a word segment model and a prosody model for analyzing the text data.

To obtain a better speech synthesis model, it is often necessary to select a professional speech speaker and to prepare a large and high coverage corpus and read sentences in the corpus by the professional speech speaker to obtain a sound recording, then disassemble and label the sentences and audio, combine them together as training data and obtain a front-end model M through a machine learning (e.g., deep learning) algorithm applicable in the computer program-implemented speech synthesis method herein _FE (e.g., hidden markov models or deep neural networks).

Specifically, the front-end model M is obtained in training _FE In the process (a) text normalization (also called preprocessing or tokenization) is first performed for converting the original text containing digits and abbreviations etc. into corresponding output words. Thereafter, each word is assigned a phonetic transcription and the text is divided and labeled into prosodic units, such as phrases, clauses, and sentences, where phonetic symbols (or pinyins) and prosodic information together form a symbolic language representation of the front-end model output. The front-end model is generally independent of the speech speaker and is available to any speaker based on the above characteristics of the front-end model.

In one embodiment, the back-end model M _BE At least comprises a tone color model and a duration model, wherein the back-end model M _BE Is obtained by training at least the following method:

selecting a speaker and identifying a speech sample of the speaker;

extracting acoustic feature data of the speaker from the identified speech samples of the speaker;

the acoustic feature data of the speaker is analyzed by a machine learning method to at least train to generate a timbre model and a duration model for speech synthesis based on the speaker.

Back end model M _BE Which may also be commonly referred to as a synthesizer, converts symbolic language representations into sound. In one embodiment, the back-end model M _BE First recording/marking the prosody (e.g. pitch contour or phoneme duration) of the speech of a certain speaker a, then obtaining a back-end model M applicable in the computer program-based speech synthesis method herein by means of an algorithm of machine learning (e.g. deep learning) _BE 。

With the rapid development of technology, more and more open source tools can be obtained, and besides the front end model and the back end model can be obtained by training data based on a large corpus, the front end model and the back end model which are commercially available and already trained can be directly purchased.

In addition, voiceprints are unique biological features of human voice, and different speakers can be distinguished according to voiceprints. Voiceprint features include acoustic features, which generally refer to a set of acoustic descriptive parameters (e.g., vectors) extracted from a sound signal using a computer algorithm. In one embodiment, the generic voiceprint model M is based on acoustic feature data of a large number of multiple speakers using a feature extraction method such as a deep neural network _VG The training is performed, so long as enough and good-quality data input is ensured, a good effect can be expected, and the universal voiceprint model has robustness and universality and can cover voiceprint spaces of various people as much as possible.

In one embodiment, the existing generic voiceprint model M can be updated by updating it _VG To indicate a certain interestVoiceprint model M of speaker B _VB . First, a small number of sentences (e.g., 5-10 sentences) from the speaker of interest are collected, wherein the collection of speech for the speaker of interest can be done automatically during use of the computer program-based implemented speech synthesis system according to the present disclosure. For example, in a vehicle scenario, the system may be opened permission to collect family speech when the driver makes a telephone contact with a family (e.g., child or wife) on the vehicle via an onboard bluetooth system. Then, based on the collected voice of the at least one speaker of interest, a corresponding voiceprint model M of a speaker B of interest (such as a child or wife) of the at least one speaker of interest is generated by model-adaptive adjustment of the generic voiceprint model _VB 。

For example, reference may be made to fig. 4, which shows a simplified voiceprint model space in fig. 4, where for ease of understanding the multidimensional model is reduced to a two-dimensional space, assuming a generic voiceprint model (M _VG ) Covering the entire space, a voiceprint model (e.g., M _VA 、M _VB ) May be one of the subspaces.

Model-adaptive adaptation of the generic voice print model may be achieved by updating at least some of the parameters of the generic voice print model to adapt to the speech of the speaker of interest. Briefly, the general voiceprint model, as a reference model, extracts a general overall range of parameters in a large amount of speaker acoustic data while training, weakening the acoustic differences of different speakers. When the acoustic features of at least one speaker of interest are added to the generic voiceprint model, the speaker of interest exhibits unique differences as individuals in the commonalities represented by the generic voiceprint model, and such differences can be made to perturb the generic voiceprint model by altering the form of corresponding acoustic data parameters, thereby enabling a specific voiceprint model to be derived for the speaker of interest.

In one embodiment, the generic voiceprint model M is based on acoustic feature data of the at least one speaker of interest _VG Making model adaptation adjustments to generate a model for the at least one modelThe corresponding customized voiceprint model for each speaker of interest includes at least:

the generic voiceprint model M is based on the respective acoustic feature data of each of the at least one speaker of interest _VG Is adjusted to generate a respective customized voiceprint model for each of the at least one speaker of interest.

In the embodiment as shown in fig. 5, the entered text is represented by a front-end model M _FE Analyzing, and obtaining a result in a back-end model M _BE And the voiceprint model M of the speaker X obtained by the foregoing steps _VX Generates corresponding speech having the acoustic characteristics of the speaker of interest X.

The speech synthesis method in the embodiment as shown in fig. 6 includes:

generating a reference speech synthesis engine comprising at least a generic voiceprint model M _VG Front end model M _FE Back end model M _BE And the voiceprint model M of the existing speaker A _VA ；

Collecting the voice of the interested speaker B and extracting acoustic characteristic data;

model adaptation of the generic voiceprint model based on acoustic feature data of the speaker of interest B;

a speech synthesis engine for generating a speaker of interest B, wherein the speech synthesis engine of the speaker of interest B comprises at least a generic voiceprint model M _VG Front end model M _FE Back end model M _BE Voiceprint model M of speaker B of interest _VB The method comprises the steps of carrying out a first treatment on the surface of the And generating a speech having a speaker B voice style using the speech synthesis engine of the speaker B of interestA sound;

judging the quality of the generated voice:

if the generated speech quality is high, the speech synthesis engine is activated in the system, and

if the generated voice quality is not ideal, returning to the collecting step, and continuously collecting the voice of the interested speaker B and extracting the acoustic characteristic data.

For the above step of determining speech quality, the determination method involved may be subjective scoring determination, i.e. a score is required for some families or friends familiar with the speaker B voice to score the synthesized audio effect; alternatively, the judgment method may be objective scoring judgment, that is, objective evaluation is performed on the generated voice audio of the speaker B using a set of special evaluation systems.

In an embodiment, the customized speech synthesis engine of any one of said at least one speaker of interest is arranged for one respective application or speech synthesis task for different applications or speech synthesis tasks. For example, a user may wish to read their respective information from a custom speech synthesis engine having the sound characteristics of their wife/child/friend/colleague. In addition, specific voice broadcasting of different application programs can be set, for example, broadcasting of different customized voice synthesis engines by different interested persons can be set for map navigation, news broadcasting and the like.

According to an embodiment of the present invention, as shown in fig. 7, there is provided a computer program-based speech synthesis system, the system comprising:

In one embodiment, the respective customized voiceprint model of the at least one speaker of interest is generated by: collecting respective voice samples of at least one speaker of interest and extracting acoustic feature data in the voice samples, model-adaptively adjusting a generic voiceprint model based on the acoustic feature data of the at least one speaker of interest to generate respective customized voiceprint models for the at least one speaker of interest.

In one embodiment, the generic voiceprint model is based on acoustic feature extraction of voices of multiple speakers and is trained using machine learning, and the reference speech synthesis engine is generated by combining the generic voiceprint model with the front-end model and the back-end model.

In one embodiment, the customized voice generating unit is further configured to:

the customized speech synthesis engine of any of the at least one speaker of interest is configured for a respective application or speech synthesis task for the different application or speech synthesis task.

In one embodiment, the front-end model includes at least a word segment model and a prosody model, the front-end model being trained by at least the following methods:

acquiring a text corpus;

identifying text data in the text corpus; and

In one embodiment, the back-end model includes at least a timbre model and a duration model, the back-end model being trained by at least the following methods:

selecting a speaker and identifying a speech sample of the speaker;

According to one embodiment of the present invention, there is provided a vehicle comprising a computer program-based speech synthesis system as described in any of the above examples.

According to one embodiment of the present invention, there is provided a computer device comprising a memory and a processor, the memory having stored thereon computer instructions executable by the processor, which when executed by the processor, instruct the processor to perform the computer program-based speech synthesis method of the present invention. The computer device may be broadly a server or any other electronic device having the necessary computing and/or processing capabilities. In one embodiment, the computer device may include a processor, memory, network interface, communication interface, etc. connected by a system bus. The processor of the computer device may be used to provide the necessary computing, processing and/or control capabilities. The memory of the computer device may include a non-volatile storage medium and an internal memory. The non-volatile storage medium may have an operating system, computer programs, etc. stored therein or thereon. The internal memory may provide an environment for the operation of the operating system and computer programs in the non-volatile storage media. The network interface and communication interface of the computer device may be used to connect and communicate with external devices via a network. The computer program, when being executed by a processor, performs the steps of the computer program-implemented speech synthesis method of the invention.

The present invention may be implemented as a computer readable storage medium having stored thereon a computer program which, when executed by a processor, causes the method of the present invention to be performed. In one embodiment, the computer program is distributed over a plurality of computer devices or processors coupled by a network such that the computer program is stored, accessed, and executed by one or more computer devices or processors in a distributed fashion. One or more method steps/operations may be performed by one or more computer devices or processors, and one or more other method steps/operations may be performed by one or more other computer devices or processors. One or more computer devices or processors may perform a single method step/operation or two or more method steps/operations.

Those of ordinary skill in the art will appreciate that all or part of the steps of the computer program based implemented speech synthesis method of the present invention may be implemented by a computer program, which may be stored in a non-transitory computer readable storage medium, to instruct related hardware such as a computer device or a processor to perform, which when executed, causes the steps of the method of the present invention to be performed. Any reference herein to memory, storage, database, or other medium may include non-volatile and/or volatile memory, as the case may be. Examples of nonvolatile memory include Read Only Memory (ROM), programmable ROM (PROM), electrically Programmable ROM (EPROM), electrically Erasable Programmable ROM (EEPROM), flash memory, magnetic tape, floppy disk, magnetic data storage devices, optical data storage devices, hard disks, solid state disks, and the like. Examples of volatile memory include Random Access Memory (RAM), external cache memory, and the like.

In this specification, whenever reference is made to "one embodiment," "another embodiment," "some embodiments," etc., it is intended that a particular feature, structure, or characteristic described in connection with the embodiment is included in at least one embodiment of the invention. The appearances of such phrases in various places in the specification are not necessarily all referring to the same embodiment. Furthermore, when a particular feature, structure, or characteristic is described in connection with any embodiment, it is submitted that it is within the purview of one skilled in the art to effect such feature, structure, or characteristic in connection with other ones of the embodiments.

The technical features described above may be arbitrarily combined. Although not all possible combinations of features are described, any combination of features should be considered to be covered by the description provided that such combinations are not inconsistent.

While the invention has been described in connection with embodiments, those skilled in the art will appreciate that various modifications and variations are possible without departing from the spirit and scope of the invention. The scope of the invention should, therefore, be determined with reference to the appended claims.

Claims

1. A computer program-based speech synthesis method, the method comprising:

2. The method of claim 1, further comprising a customized speech synthesis engine for any of the at least one speaker of interest being configured for a respective application or speech synthesis task for a different application or speech synthesis task.

3. The method according to claim 1 or 2, wherein the front-end model comprises at least a word segment model and a prosody model, the front-end model being obtained by training at least the following methods:

acquiring a text corpus;

identifying text data in the text corpus; and

4. The method according to claim 1 or 2, wherein the back-end model comprises at least a timbre model and a duration model, the back-end model being obtained by training at least the following methods:

selecting a speaker and identifying a speech sample of the speaker;

5. The method of claim 1, wherein the model adapting the generic voiceprint model based on the acoustic feature data of the at least one speaker of interest to generate a respective customized voiceprint model for the at least one speaker of interest comprises at least:

parameters of the generic voiceprint model that are related to characterizing the acoustic feature data are adjusted in accordance with the respective acoustic feature data for each of the at least one speaker of interest to generate a respective customized voiceprint model for each of the at least one speaker of interest.

6. A computer program-based speech synthesis system, the system comprising:

7. The system of claim 6, wherein the respective customized voiceprint model of the at least one speaker of interest is generated by: collecting respective voice samples of at least one speaker of interest and extracting acoustic feature data in the voice samples, model-adaptively adjusting a generic voiceprint model based on the acoustic feature data of the at least one speaker of interest to generate respective customized voiceprint models for the at least one speaker of interest.

8. The system of claim 7, wherein the generic voiceprint model is based on acoustic feature extraction of voices of a plurality of speakers and is trained using machine learning, the reference speech synthesis engine being generated by combining the generic voiceprint model with the front-end model and the back-end model.

9. The system of any of claims 6 to 8, wherein the customized speech generation unit is further configured to:

10. The system according to any one of claims 6 to 8, wherein the front-end model comprises at least a word segment model and a prosody model, the front-end model being obtained by training at least:

acquiring a text corpus;

identifying text data in the text corpus; and

11. The system of any of claims 6 to 8, wherein the back-end model comprises at least a timbre model and a duration model, the back-end model being obtained by training at least:

selecting a speaker and identifying a speech sample of the speaker;

12. A vehicle comprising a computer program based speech synthesis system according to any of claims 6 to 11.

13. A computer device comprising a memory and a processor, the memory having stored thereon computer instructions executable by the processor, which when executed by the processor, instruct the processor to perform the computer program-based speech synthesis method according to any one of claims 1-5.

14. A computer readable storage medium having stored thereon a computer program which, when executed by a processor, causes the computer program-based speech synthesis method according to any of claims 1-5 to be performed.