CN116798400A - Speech synthesis method and system based on computer program - Google Patents

Speech synthesis method and system based on computer program Download PDF

Info

Publication number
CN116798400A
CN116798400A CN202210237919.7A CN202210237919A CN116798400A CN 116798400 A CN116798400 A CN 116798400A CN 202210237919 A CN202210237919 A CN 202210237919A CN 116798400 A CN116798400 A CN 116798400A
Authority
CN
China
Prior art keywords
model
speaker
speech synthesis
interest
speech
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210237919.7A
Other languages
Chinese (zh)
Inventor
雷文辉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Porsche Shanghai Digital Technology Co ltd
Original Assignee
Porsche Shanghai Digital Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Porsche Shanghai Digital Technology Co ltd filed Critical Porsche Shanghai Digital Technology Co ltd
Priority to CN202210237919.7A priority Critical patent/CN116798400A/en
Publication of CN116798400A publication Critical patent/CN116798400A/en
Pending legal-status Critical Current

Links

Landscapes

  • Fittings On The Vehicle Exterior For Carrying Loads, And Devices For Holding Or Mounting Articles (AREA)

Abstract

A computer program-based speech synthesis method, comprising: acquiring a front-end model and a back-end model for speech synthesis; acquiring a universal voiceprint model to generate a reference speech synthesis engine; model-adaptively adjusting the generic voiceprint model based on the collected acoustic feature data of the at least one speaker of interest to generate a corresponding customized voiceprint model for the at least one speaker of interest; generating a respective customized speech synthesis engine for the at least one speaker of interest; processing text to be read using a corresponding customized speech synthesis engine of the one of the at least one speaker of interest selected by the user to generate a corresponding speech having acoustic characteristics of the one of the user selected. According to the method, the system and the vehicle, a speech synthesis engine oriented to users and content customization can be generated, and rich personalized experience is provided for the users.

Description

Speech synthesis method and system based on computer program
Technical Field
The present invention relates to the field of speech synthesis, and in particular, to a computer program-based speech synthesis method, system, corresponding vehicle, computer device, and computer-readable storage medium.
Background
Speech synthesis (Text-To-Speech, also known as "Text-To-Speech") is one of the most important feedback when a machine interacts with a human being, as it enables the input Text To be automatically Speech-generated by the machine.
In recent years, voice assistants (voice assistances) have become popular in the field of smart vehicles and smart phones, which are capable of providing simple voice recognition, natural language understanding, voice synthesis techniques, and enriching user experience based on these advantageous techniques, for example: broadcasting navigation and weather inquiry, music searching and playing, news reading, vehicle control and the like.
In the vehicle industry, suppliers of speech synthesis models provide their speech synthesis models to original equipment manufacturers or primary manufacturing suppliers of vehicles to enable the speech synthesis models to be integrated into electronic and electrical systems (E/E systems) and use the speech synthesis models to read out speech with input text. For example, the driver inquires the weather information from the internet and responds accordingly by inquiring the information from the voice assistant, for example, if the inquired information is "the open day is a sunny day and the temperature is 26 degrees", the information is displayed on the display screen of the vehicle and is read out through the voice synthesis technology. In addition, the speech style of speech synthesis in a vehicle is identical to the voice of a speech speaker determined in a speech synthesis model provided by the vehicle manufacturer or vendor.
However, when facing the needs of new generation users, existing speech synthesis techniques also suffer from a number of drawbacks: firstly, the sound styles of speakers for speech synthesis used by the same vehicle model and even the same vehicle brand are the same and specific, and hardly differ from each other; furthermore, since the data training process of speech synthesis is time consuming and expensive, it is almost impossible to generate a unique speech synthesis engine model for each user; finally, the speech styles of the corresponding speech synthesis are the same for different contents, and the user cannot select his own favorite sound style for some special contents.
Thus, there is a need for a new, personalized, improved solution for content-oriented and user-customized speech synthesis.
Disclosure of Invention
To improve at least one of the above problems, the present invention provides a computer program-based speech synthesis method, system, corresponding vehicle, computer device and computer-readable storage medium.
According to a first aspect of the present invention, there is provided a computer program-based speech synthesis method, the method comprising:
acquiring a front-end model and a back-end model for speech synthesis, wherein the front-end model at least represents a model for analyzing and processing text, and the back-end model at least represents a model for representing acoustic characteristics of speech of one speaker;
acquiring a universal voiceprint model and combining the universal voiceprint model with the front-end model and the back-end model to generate a reference speech synthesis engine, wherein the universal voiceprint model is obtained by performing acoustic feature extraction on the voices of a plurality of speakers and training by using machine learning;
collecting respective voice samples of at least one speaker of interest and extracting acoustic feature data in the voice samples, performing model adaptation on the generic voiceprint model based on the acoustic feature data of the at least one speaker of interest to generate respective customized voiceprint models for the at least one speaker of interest;
combining the respective customized voiceprint models of the at least one speaker of interest with the front-end model and the back-end model to generate respective customized speech synthesis engines for the at least one speaker of interest based on adjusting the baseline speech synthesis engine;
processing text to be read using a corresponding customized speech synthesis engine of the one of the at least one speaker of interest selected by the user to generate a corresponding speech having acoustic characteristics of the one of the user selected.
According to a second aspect of the present invention, there is provided a computer program-based speech synthesis system, the system comprising:
a customized speech synthesis engine generation unit configured to combine respective customized voiceprint models of at least one speaker of interest with a front-end model and a back-end model to generate respective customized speech synthesis engines for the at least one speaker of interest based on adjustments to a reference speech synthesis engine, wherein the front-end model represents at least a model that performs an analysis process on text and the back-end model represents at least a model that characterizes acoustic features of speech of one speaker;
a customized speech generating unit configured to process text to be read out using a corresponding customized speech synthesis engine of a user selected one of the at least one speaker of interest to generate a corresponding speech having acoustic characteristics of the user selected one of the speaker of interest.
According to a third aspect of the present invention there is provided a vehicle comprising a computer program based speech synthesis system as described in any of the embodiments of the second aspect above.
According to a fourth aspect of the present invention there is provided a computer device comprising a memory and a processor, the memory having stored thereon computer instructions executable by the processor, which when executed by the processor, instruct the processor to perform a computer program-based speech synthesis method according to any of the embodiments of the first aspect described above.
According to a fifth aspect of the present invention, there is provided a computer readable storage medium having stored thereon a computer program which, when executed by a processor, causes a computer program-based implementation of a speech synthesis method according to any of the embodiments of the first aspect described above to be performed.
The voice synthesis method, the device, the vehicle, the computer equipment and the computer readable storage medium based on the computer program can enable a user to obtain various voice styles expected by the user on the basis of the original voice synthesis system and fully display the individuality of the user; furthermore, since the present invention provides a very smooth and natural way to obtain a new speech synthesis system, neither the user nor the speech synthesis engine provider need to expend too much cost and effort for this; the user can also select different speech synthesis styles to read different contents, and the invention can provide rich personalized experience for the user through low cost and simple operation.
Drawings
Non-limiting and non-exhaustive embodiments of the present invention are described by way of example with reference to the following drawings, wherein:
fig. 1 is a schematic flow chart of a computer program-based implemented speech synthesis method according to an embodiment of the invention.
Fig. 2 is a schematic diagram of a computer program-based implementation of a speech synthesis method according to an embodiment of the invention.
FIG. 3 is a schematic representation of the generation of a voiceprint model of a speaker of interest in accordance with one embodiment of the present invention.
Fig. 4 is a simplified voiceprint model space diagram in accordance with one embodiment of the present invention.
Fig. 5 is a schematic diagram of a computer program-based implementation of a speech synthesis method according to one embodiment of the invention.
Fig. 6 is a schematic flow chart of a computer program based implemented speech synthesis method according to another embodiment of the invention.
Fig. 7 is a schematic diagram of a computer program-based speech synthesis system according to one embodiment of the invention.
Detailed Description
To further clarify the above and other features and advantages of the present invention, a further description of the invention will be rendered by reference to the appended drawings. It should be understood that the specific embodiments presented herein are for purposes of explanation to those skilled in the art and are intended to be illustrative only and not limiting.
Fig. 1 schematically shows a computer program based speech synthesis method S100 according to an embodiment of the invention. The method S100 may include step S110, step S120, step S130, step S140, and step S150.
In step S110, a front-end model M for speech synthesis is acquired FE And back-end model M BE Wherein the front-end model M FE At least representing a model for analyzing text, the back-end model M BE Representing at least a model characterizing acoustic features of a speaker's speech.
In step S120, a general voiceprint model M is obtained VG And model the general voiceprint M VG And the front end model M FE And the back-end model M BE Combining to generate a reference speech synthesis engine, wherein the generic voiceprint model is based on acoustic feature extraction of speech from a plurality of speakers and is trained using machine learning.
At step S130, corresponding speech samples of at least one speaker of interest are collected and acoustic feature data in the speech samples are extracted, and the generic voiceprint model M is based on the acoustic feature data of the at least one speaker of interest VG Model adaptation is performed to generate a respective customized voiceprint model for the at least one speaker of interest.
In step S140, a respective customized voiceprint model of the at least one speaker of interest is compared to the front-end model M FE And the back-end model M BE In combination to generate a respective customized speech synthesis engine for the at least one speaker of interest based on the adjustment to the baseline speech synthesis engine.
In step S150, the text to be read out is processed using the corresponding customized speech synthesis engine of the one of the at least one speaker of interest selected by the user to generate a corresponding speech having the acoustic characteristics of the one of the user selected.
Fig. 2 is a schematic diagram of a computer program-based implementation of a speech synthesis method according to an embodiment of the invention.
In one embodiment, the front-end model M FE At least comprises a word segment model and a prosody model, wherein the front end model M FE Is obtained by training at least the following method:
acquiring a text corpus;
identifying text data in the text corpus; and
the text data is analyzed by a machine learning method to at least train a word segment model and a prosody model for analyzing the text data.
To obtain a better speech synthesis model, it is often necessary to select a professional speech speaker and to prepare a large and high coverage corpus and read sentences in the corpus by the professional speech speaker to obtain a sound recording, then disassemble and label the sentences and audio, combine them together as training data and obtain a front-end model M through a machine learning (e.g., deep learning) algorithm applicable in the computer program-implemented speech synthesis method herein FE (e.g., hidden markov models or deep neural networks).
Specifically, the front-end model M is obtained in training FE In the process (a) text normalization (also called preprocessing or tokenization) is first performed for converting the original text containing digits and abbreviations etc. into corresponding output words. Thereafter, each word is assigned a phonetic transcription and the text is divided and labeled into prosodic units, such as phrases, clauses, and sentences, where phonetic symbols (or pinyins) and prosodic information together form a symbolic language representation of the front-end model output. The front-end model is generally independent of the speech speaker and is available to any speaker based on the above characteristics of the front-end model.
In one embodiment, the back-end model M BE At least comprises a tone color model and a duration model, wherein the back-end model M BE Is obtained by training at least the following method:
selecting a speaker and identifying a speech sample of the speaker;
extracting acoustic feature data of the speaker from the identified speech samples of the speaker;
the acoustic feature data of the speaker is analyzed by a machine learning method to at least train to generate a timbre model and a duration model for speech synthesis based on the speaker.
Back end model M BE Which may also be commonly referred to as a synthesizer, converts symbolic language representations into sound. In one embodiment, the back-end model M BE First recording/marking the prosody (e.g. pitch contour or phoneme duration) of the speech of a certain speaker a, then obtaining a back-end model M applicable in the computer program-based speech synthesis method herein by means of an algorithm of machine learning (e.g. deep learning) BE
With the rapid development of technology, more and more open source tools can be obtained, and besides the front end model and the back end model can be obtained by training data based on a large corpus, the front end model and the back end model which are commercially available and already trained can be directly purchased.
In addition, voiceprints are unique biological features of human voice, and different speakers can be distinguished according to voiceprints. Voiceprint features include acoustic features, which generally refer to a set of acoustic descriptive parameters (e.g., vectors) extracted from a sound signal using a computer algorithm. In one embodiment, the generic voiceprint model M is based on acoustic feature data of a large number of multiple speakers using a feature extraction method such as a deep neural network VG The training is performed, so long as enough and good-quality data input is ensured, a good effect can be expected, and the universal voiceprint model has robustness and universality and can cover voiceprint spaces of various people as much as possible.
FIG. 3 is a schematic representation of the generation of a voiceprint model of a speaker of interest in accordance with one embodiment of the present invention.
In one embodiment, the existing generic voiceprint model M can be updated by updating it VG To indicate a certain interestVoiceprint model M of speaker B VB . First, a small number of sentences (e.g., 5-10 sentences) from the speaker of interest are collected, wherein the collection of speech for the speaker of interest can be done automatically during use of the computer program-based implemented speech synthesis system according to the present disclosure. For example, in a vehicle scenario, the system may be opened permission to collect family speech when the driver makes a telephone contact with a family (e.g., child or wife) on the vehicle via an onboard bluetooth system. Then, based on the collected voice of the at least one speaker of interest, a corresponding voiceprint model M of a speaker B of interest (such as a child or wife) of the at least one speaker of interest is generated by model-adaptive adjustment of the generic voiceprint model VB
For example, reference may be made to fig. 4, which shows a simplified voiceprint model space in fig. 4, where for ease of understanding the multidimensional model is reduced to a two-dimensional space, assuming a generic voiceprint model (M VG ) Covering the entire space, a voiceprint model (e.g., M VA 、M VB ) May be one of the subspaces.
Model-adaptive adaptation of the generic voice print model may be achieved by updating at least some of the parameters of the generic voice print model to adapt to the speech of the speaker of interest. Briefly, the general voiceprint model, as a reference model, extracts a general overall range of parameters in a large amount of speaker acoustic data while training, weakening the acoustic differences of different speakers. When the acoustic features of at least one speaker of interest are added to the generic voiceprint model, the speaker of interest exhibits unique differences as individuals in the commonalities represented by the generic voiceprint model, and such differences can be made to perturb the generic voiceprint model by altering the form of corresponding acoustic data parameters, thereby enabling a specific voiceprint model to be derived for the speaker of interest.
In one embodiment, the generic voiceprint model M is based on acoustic feature data of the at least one speaker of interest VG Making model adaptation adjustments to generate a model for the at least one modelThe corresponding customized voiceprint model for each speaker of interest includes at least:
the generic voiceprint model M is based on the respective acoustic feature data of each of the at least one speaker of interest VG Is adjusted to generate a respective customized voiceprint model for each of the at least one speaker of interest.
Fig. 5 is a schematic diagram of a computer program-based implementation of a speech synthesis method according to one embodiment of the invention.
In the embodiment as shown in fig. 5, the entered text is represented by a front-end model M FE Analyzing, and obtaining a result in a back-end model M BE And the voiceprint model M of the speaker X obtained by the foregoing steps VX Generates corresponding speech having the acoustic characteristics of the speaker of interest X.
Fig. 6 is a schematic flow chart of a computer program based implemented speech synthesis method according to another embodiment of the invention.
The speech synthesis method in the embodiment as shown in fig. 6 includes:
generating a reference speech synthesis engine comprising at least a generic voiceprint model M VG Front end model M FE Back end model M BE And the voiceprint model M of the existing speaker A VA
Collecting the voice of the interested speaker B and extracting acoustic characteristic data;
model adaptation of the generic voiceprint model based on acoustic feature data of the speaker of interest B;
a speech synthesis engine for generating a speaker of interest B, wherein the speech synthesis engine of the speaker of interest B comprises at least a generic voiceprint model M VG Front end model M FE Back end model M BE Voiceprint model M of speaker B of interest VB The method comprises the steps of carrying out a first treatment on the surface of the And generating a speech having a speaker B voice style using the speech synthesis engine of the speaker B of interestA sound;
judging the quality of the generated voice:
if the generated speech quality is high, the speech synthesis engine is activated in the system, and
if the generated voice quality is not ideal, returning to the collecting step, and continuously collecting the voice of the interested speaker B and extracting the acoustic characteristic data.
For the above step of determining speech quality, the determination method involved may be subjective scoring determination, i.e. a score is required for some families or friends familiar with the speaker B voice to score the synthesized audio effect; alternatively, the judgment method may be objective scoring judgment, that is, objective evaluation is performed on the generated voice audio of the speaker B using a set of special evaluation systems.
In an embodiment, the customized speech synthesis engine of any one of said at least one speaker of interest is arranged for one respective application or speech synthesis task for different applications or speech synthesis tasks. For example, a user may wish to read their respective information from a custom speech synthesis engine having the sound characteristics of their wife/child/friend/colleague. In addition, specific voice broadcasting of different application programs can be set, for example, broadcasting of different customized voice synthesis engines by different interested persons can be set for map navigation, news broadcasting and the like.
According to an embodiment of the present invention, as shown in fig. 7, there is provided a computer program-based speech synthesis system, the system comprising:
a customized speech synthesis engine generation unit configured to combine respective customized voiceprint models of at least one speaker of interest with a front-end model and a back-end model to generate respective customized speech synthesis engines for the at least one speaker of interest based on adjustments to a reference speech synthesis engine, wherein the front-end model represents at least a model that performs an analysis process on text and the back-end model represents at least a model that characterizes acoustic features of speech of one speaker;
a customized speech generating unit configured to process text to be read out using a corresponding customized speech synthesis engine of a user selected one of the at least one speaker of interest to generate a corresponding speech having acoustic characteristics of the user selected one of the speaker of interest.
In one embodiment, the respective customized voiceprint model of the at least one speaker of interest is generated by: collecting respective voice samples of at least one speaker of interest and extracting acoustic feature data in the voice samples, model-adaptively adjusting a generic voiceprint model based on the acoustic feature data of the at least one speaker of interest to generate respective customized voiceprint models for the at least one speaker of interest.
In one embodiment, the generic voiceprint model is based on acoustic feature extraction of voices of multiple speakers and is trained using machine learning, and the reference speech synthesis engine is generated by combining the generic voiceprint model with the front-end model and the back-end model.
In one embodiment, the customized voice generating unit is further configured to:
the customized speech synthesis engine of any of the at least one speaker of interest is configured for a respective application or speech synthesis task for the different application or speech synthesis task.
In one embodiment, the front-end model includes at least a word segment model and a prosody model, the front-end model being trained by at least the following methods:
acquiring a text corpus;
identifying text data in the text corpus; and
the text data is analyzed by a machine learning method to at least train a word segment model and a prosody model for analyzing the text data.
In one embodiment, the back-end model includes at least a timbre model and a duration model, the back-end model being trained by at least the following methods:
selecting a speaker and identifying a speech sample of the speaker;
extracting acoustic feature data of the speaker from the identified speech samples of the speaker;
the acoustic feature data of the speaker is analyzed by a machine learning method to at least train to generate a timbre model and a duration model for speech synthesis based on the speaker.
According to one embodiment of the present invention, there is provided a vehicle comprising a computer program-based speech synthesis system as described in any of the above examples.
According to one embodiment of the present invention, there is provided a computer device comprising a memory and a processor, the memory having stored thereon computer instructions executable by the processor, which when executed by the processor, instruct the processor to perform the computer program-based speech synthesis method of the present invention. The computer device may be broadly a server or any other electronic device having the necessary computing and/or processing capabilities. In one embodiment, the computer device may include a processor, memory, network interface, communication interface, etc. connected by a system bus. The processor of the computer device may be used to provide the necessary computing, processing and/or control capabilities. The memory of the computer device may include a non-volatile storage medium and an internal memory. The non-volatile storage medium may have an operating system, computer programs, etc. stored therein or thereon. The internal memory may provide an environment for the operation of the operating system and computer programs in the non-volatile storage media. The network interface and communication interface of the computer device may be used to connect and communicate with external devices via a network. The computer program, when being executed by a processor, performs the steps of the computer program-implemented speech synthesis method of the invention.
The present invention may be implemented as a computer readable storage medium having stored thereon a computer program which, when executed by a processor, causes the method of the present invention to be performed. In one embodiment, the computer program is distributed over a plurality of computer devices or processors coupled by a network such that the computer program is stored, accessed, and executed by one or more computer devices or processors in a distributed fashion. One or more method steps/operations may be performed by one or more computer devices or processors, and one or more other method steps/operations may be performed by one or more other computer devices or processors. One or more computer devices or processors may perform a single method step/operation or two or more method steps/operations.
Those of ordinary skill in the art will appreciate that all or part of the steps of the computer program based implemented speech synthesis method of the present invention may be implemented by a computer program, which may be stored in a non-transitory computer readable storage medium, to instruct related hardware such as a computer device or a processor to perform, which when executed, causes the steps of the method of the present invention to be performed. Any reference herein to memory, storage, database, or other medium may include non-volatile and/or volatile memory, as the case may be. Examples of nonvolatile memory include Read Only Memory (ROM), programmable ROM (PROM), electrically Programmable ROM (EPROM), electrically Erasable Programmable ROM (EEPROM), flash memory, magnetic tape, floppy disk, magnetic data storage devices, optical data storage devices, hard disks, solid state disks, and the like. Examples of volatile memory include Random Access Memory (RAM), external cache memory, and the like.
In this specification, whenever reference is made to "one embodiment," "another embodiment," "some embodiments," etc., it is intended that a particular feature, structure, or characteristic described in connection with the embodiment is included in at least one embodiment of the invention. The appearances of such phrases in various places in the specification are not necessarily all referring to the same embodiment. Furthermore, when a particular feature, structure, or characteristic is described in connection with any embodiment, it is submitted that it is within the purview of one skilled in the art to effect such feature, structure, or characteristic in connection with other ones of the embodiments.
The technical features described above may be arbitrarily combined. Although not all possible combinations of features are described, any combination of features should be considered to be covered by the description provided that such combinations are not inconsistent.
While the invention has been described in connection with embodiments, those skilled in the art will appreciate that various modifications and variations are possible without departing from the spirit and scope of the invention. The scope of the invention should, therefore, be determined with reference to the appended claims.

Claims (14)

1. A computer program-based speech synthesis method, the method comprising:
acquiring a front-end model and a back-end model for speech synthesis, wherein the front-end model at least represents a model for analyzing and processing text, and the back-end model at least represents a model for representing acoustic characteristics of speech of one speaker;
acquiring a universal voiceprint model and combining the universal voiceprint model with the front-end model and the back-end model to generate a reference speech synthesis engine, wherein the universal voiceprint model is obtained by performing acoustic feature extraction on the voices of a plurality of speakers and training by using machine learning;
collecting respective voice samples of at least one speaker of interest and extracting acoustic feature data in the voice samples, performing model adaptation on the generic voiceprint model based on the acoustic feature data of the at least one speaker of interest to generate respective customized voiceprint models for the at least one speaker of interest;
combining the respective customized voiceprint models of the at least one speaker of interest with the front-end model and the back-end model to generate respective customized speech synthesis engines for the at least one speaker of interest based on adjusting the baseline speech synthesis engine;
processing text to be read using a corresponding customized speech synthesis engine of the one of the at least one speaker of interest selected by the user to generate a corresponding speech having acoustic characteristics of the one of the user selected.
2. The method of claim 1, further comprising a customized speech synthesis engine for any of the at least one speaker of interest being configured for a respective application or speech synthesis task for a different application or speech synthesis task.
3. The method according to claim 1 or 2, wherein the front-end model comprises at least a word segment model and a prosody model, the front-end model being obtained by training at least the following methods:
acquiring a text corpus;
identifying text data in the text corpus; and
the text data is analyzed by a machine learning method to at least train a word segment model and a prosody model for analyzing the text data.
4. The method according to claim 1 or 2, wherein the back-end model comprises at least a timbre model and a duration model, the back-end model being obtained by training at least the following methods:
selecting a speaker and identifying a speech sample of the speaker;
extracting acoustic feature data of the speaker from the identified speech samples of the speaker;
the acoustic feature data of the speaker is analyzed by a machine learning method to at least train to generate a timbre model and a duration model for speech synthesis based on the speaker.
5. The method of claim 1, wherein the model adapting the generic voiceprint model based on the acoustic feature data of the at least one speaker of interest to generate a respective customized voiceprint model for the at least one speaker of interest comprises at least:
parameters of the generic voiceprint model that are related to characterizing the acoustic feature data are adjusted in accordance with the respective acoustic feature data for each of the at least one speaker of interest to generate a respective customized voiceprint model for each of the at least one speaker of interest.
6. A computer program-based speech synthesis system, the system comprising:
a customized speech synthesis engine generation unit configured to combine respective customized voiceprint models of at least one speaker of interest with a front-end model and a back-end model to generate respective customized speech synthesis engines for the at least one speaker of interest based on adjustments to a reference speech synthesis engine, wherein the front-end model represents at least a model that performs an analysis process on text and the back-end model represents at least a model that characterizes acoustic features of speech of one speaker;
a customized speech generating unit configured to process text to be read out using a corresponding customized speech synthesis engine of a user selected one of the at least one speaker of interest to generate a corresponding speech having acoustic characteristics of the user selected one of the speaker of interest.
7. The system of claim 6, wherein the respective customized voiceprint model of the at least one speaker of interest is generated by: collecting respective voice samples of at least one speaker of interest and extracting acoustic feature data in the voice samples, model-adaptively adjusting a generic voiceprint model based on the acoustic feature data of the at least one speaker of interest to generate respective customized voiceprint models for the at least one speaker of interest.
8. The system of claim 7, wherein the generic voiceprint model is based on acoustic feature extraction of voices of a plurality of speakers and is trained using machine learning, the reference speech synthesis engine being generated by combining the generic voiceprint model with the front-end model and the back-end model.
9. The system of any of claims 6 to 8, wherein the customized speech generation unit is further configured to:
the customized speech synthesis engine of any of the at least one speaker of interest is configured for a respective application or speech synthesis task for the different application or speech synthesis task.
10. The system according to any one of claims 6 to 8, wherein the front-end model comprises at least a word segment model and a prosody model, the front-end model being obtained by training at least:
acquiring a text corpus;
identifying text data in the text corpus; and
the text data is analyzed by a machine learning method to at least train a word segment model and a prosody model for analyzing the text data.
11. The system of any of claims 6 to 8, wherein the back-end model comprises at least a timbre model and a duration model, the back-end model being obtained by training at least:
selecting a speaker and identifying a speech sample of the speaker;
extracting acoustic feature data of the speaker from the identified speech samples of the speaker;
the acoustic feature data of the speaker is analyzed by a machine learning method to at least train to generate a timbre model and a duration model for speech synthesis based on the speaker.
12. A vehicle comprising a computer program based speech synthesis system according to any of claims 6 to 11.
13. A computer device comprising a memory and a processor, the memory having stored thereon computer instructions executable by the processor, which when executed by the processor, instruct the processor to perform the computer program-based speech synthesis method according to any one of claims 1-5.
14. A computer readable storage medium having stored thereon a computer program which, when executed by a processor, causes the computer program-based speech synthesis method according to any of claims 1-5 to be performed.
CN202210237919.7A 2022-03-11 2022-03-11 Speech synthesis method and system based on computer program Pending CN116798400A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210237919.7A CN116798400A (en) 2022-03-11 2022-03-11 Speech synthesis method and system based on computer program

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210237919.7A CN116798400A (en) 2022-03-11 2022-03-11 Speech synthesis method and system based on computer program

Publications (1)

Publication Number Publication Date
CN116798400A true CN116798400A (en) 2023-09-22

Family

ID=88044720

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210237919.7A Pending CN116798400A (en) 2022-03-11 2022-03-11 Speech synthesis method and system based on computer program

Country Status (1)

Country Link
CN (1) CN116798400A (en)

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104143326A (en) * 2013-12-03 2014-11-12 腾讯科技(深圳)有限公司 Voice command recognition method and device
CN107492382A (en) * 2016-06-13 2017-12-19 阿里巴巴集团控股有限公司 Voiceprint extracting method and device based on neutral net
CN110858484A (en) * 2018-08-22 2020-03-03 北京航天长峰科技工业集团有限公司 Voice recognition method based on voiceprint recognition technology
CN111048064A (en) * 2020-03-13 2020-04-21 同盾控股有限公司 Voice cloning method and device based on single speaker voice synthesis data set
CN111508469A (en) * 2020-04-26 2020-08-07 北京声智科技有限公司 Text-to-speech conversion method and device
CN112750419A (en) * 2020-12-31 2021-05-04 科大讯飞股份有限公司 Voice synthesis method and device, electronic equipment and storage medium
CN113112988A (en) * 2021-03-30 2021-07-13 上海红阵信息科技有限公司 Speech synthesis processing system and method based on AI processing

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104143326A (en) * 2013-12-03 2014-11-12 腾讯科技(深圳)有限公司 Voice command recognition method and device
CN107492382A (en) * 2016-06-13 2017-12-19 阿里巴巴集团控股有限公司 Voiceprint extracting method and device based on neutral net
CN110858484A (en) * 2018-08-22 2020-03-03 北京航天长峰科技工业集团有限公司 Voice recognition method based on voiceprint recognition technology
CN111048064A (en) * 2020-03-13 2020-04-21 同盾控股有限公司 Voice cloning method and device based on single speaker voice synthesis data set
CN111508469A (en) * 2020-04-26 2020-08-07 北京声智科技有限公司 Text-to-speech conversion method and device
CN112750419A (en) * 2020-12-31 2021-05-04 科大讯飞股份有限公司 Voice synthesis method and device, electronic equipment and storage medium
CN113112988A (en) * 2021-03-30 2021-07-13 上海红阵信息科技有限公司 Speech synthesis processing system and method based on AI processing

Similar Documents

Publication Publication Date Title
CN108962217B (en) Speech synthesis method and related equipment
US10891928B2 (en) Automatic song generation
KR102582291B1 (en) Emotion information-based voice synthesis method and device
CN108806655B (en) Automatic generation of songs
US9240177B2 (en) System and method for generating customized text-to-speech voices
CN101030368B (en) Method and system for communicating across channels simultaneously with emotion preservation
US8666743B2 (en) Speech recognition method for selecting a combination of list elements via a speech input
CN111587455A (en) Text-to-speech synthesis method and apparatus using machine learning, and computer-readable storage medium
JP6639285B2 (en) Voice quality preference learning device, voice quality preference learning method and program
US20090254349A1 (en) Speech synthesizer
WO2004047076A1 (en) Standard model creating device and standard model creating method
JP2004037721A (en) System and program for voice response and storage medium therefor
US11450306B2 (en) Systems and methods for generating synthesized speech responses to voice inputs by training a neural network model based on the voice input prosodic metrics and training voice inputs
US20230274727A1 (en) Instantaneous learning in text-to-speech during dialog
CN112562681B (en) Speech recognition method and apparatus, and storage medium
JP4586615B2 (en) Speech synthesis apparatus, speech synthesis method, and computer program
Obin et al. Similarity search of acted voices for automatic voice casting
CN112487248A (en) Video file label generation method and device, intelligent terminal and storage medium
JP2003330485A (en) Voice recognition device, voice recognition system, and method for voice recognition
US20030055642A1 (en) Voice recognition apparatus and method
US20040181407A1 (en) Method and system for creating speech vocabularies in an automated manner
CN116798400A (en) Speech synthesis method and system based on computer program
CN115472185A (en) Voice generation method, device, equipment and storage medium
Coto-Jiménez Measuring the effect of reverberation on statistical parametric speech synthesis
CN118016048A (en) Voice interaction method, device, computer equipment and readable storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination