CN110379407B - Adaptive speech synthesis method, device, readable storage medium and computing equipment - Google Patents

Adaptive speech synthesis method, device, readable storage medium and computing equipment Download PDF

Info

Publication number
CN110379407B
CN110379407B CN201910661648.6A CN201910661648A CN110379407B CN 110379407 B CN110379407 B CN 110379407B CN 201910661648 A CN201910661648 A CN 201910661648A CN 110379407 B CN110379407 B CN 110379407B
Authority
CN
China
Prior art keywords
voice
speaker
voice data
model
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910661648.6A
Other languages
Chinese (zh)
Other versions
CN110379407A (en
Inventor
殷昊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Go Out And Ask Suzhou Information Technology Co ltd
Original Assignee
Go Out And Ask Suzhou Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Go Out And Ask Suzhou Information Technology Co ltd filed Critical Go Out And Ask Suzhou Information Technology Co ltd
Priority to CN201910661648.6A priority Critical patent/CN110379407B/en
Publication of CN110379407A publication Critical patent/CN110379407A/en
Application granted granted Critical
Publication of CN110379407B publication Critical patent/CN110379407B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks

Abstract

The disclosed embodiments provide a self-adaptive speech synthesis method, device, readable storage medium and computing device, which are used for synthesizing a good-effect speaker speech under the condition of only a small amount of speech data with low pronunciation quality. The method comprises the following steps: acquiring basic voice data and text data corresponding to the basic voice data; training a basic voice model according to the basic voice data and the text data corresponding to the basic voice data; acquiring voice data of a speaker and text data corresponding to the voice data of the speaker; training a GRU voice model according to the voice data of the speaker, the text data corresponding to the voice data of the speaker and the basic voice model; when a voice synthesis instruction is received, synthesizing the voice of the speaker according to the GRU voice model and the character information contained in the instruction.

Description

Adaptive speech synthesis method, device, readable storage medium and computing equipment
Technical Field
The present disclosure relates to the field of speech processing technologies, and in particular, to a method and an apparatus for adaptive speech synthesis, a readable storage medium, and a computing device.
Background
Speech synthesis refers to a technique in which a computer automatically generates corresponding speech from text. Current speech synthesis systems require the use of large amounts of high quality (requiring professional recording equipment to record) data, which can be labor and financial intensive to collect. In addition, each new speaker is added, and a new batch of data needs to be recorded in the recording studio.
How to synthesize a good-result speaker voice under the condition of only a small amount of voice data with low pronunciation quality is a technical problem to be solved urgently.
Disclosure of Invention
To this end, the present disclosure provides an adaptive speech synthesis method, apparatus, readable storage medium and computing device in an effort to solve or at least mitigate at least one of the problems identified above.
According to an aspect of the embodiments of the present disclosure, there is provided an adaptive speech synthesis method, including:
acquiring basic voice data and text data corresponding to the basic voice data;
training a basic voice model according to the basic voice data and the text data corresponding to the basic voice data;
acquiring voice data of a speaker and text data corresponding to the voice data of the speaker;
training a GRU voice model according to the voice data of the speaker, the text data corresponding to the voice data of the speaker and the basic voice model;
when a voice synthesis instruction is received, synthesizing the voice of the speaker according to the GRU voice model and the character information contained in the instruction.
Optionally, training the GRU speech model according to the speech data of the speaker and the text data corresponding to the speech data of the speaker, and the basic speech model, includes:
determining acoustic characteristics and phoneme characteristics of the speaker according to the voice data of the speaker and text data corresponding to the voice data of the speaker;
and training the GRU voice model according to the acoustic characteristics and the phoneme characteristics of the speaker and the basic voice model.
Optionally, determining the phoneme characteristics of the speaker comprises:
processing the voice data of the speaker and the text data corresponding to the voice data of the speaker according to a preset voice recognition model to obtain duration alignment information of phonemes of the speaker;
and determining the phoneme characteristics of the speaker according to the duration alignment information of the phoneme of the speaker and the text data corresponding to the voice data of the speaker.
Optionally, before determining the acoustic feature and the phoneme feature of the speaker, the method further includes:
the voice data of the speaker is preprocessed.
Optionally, the pre-processing comprises:
noise reduction and/or dereverberation.
Optionally, the method further comprises:
and verifying that the voice data of the speaker corresponds to the text data corresponding to the voice data of the speaker one to one.
Optionally, training the GRU speech model based on the acoustic and phoneme characteristics of the speaker and the base speech model, comprises:
initializing a GRU voice model according to a basic voice model;
and processing the phoneme characteristics and the acoustic characteristics into a format required by the GRU voice model, and inputting the phoneme characteristics and the acoustic characteristics into the GRU voice model to finish the training of the GRU voice model.
According to still another aspect of the embodiments of the present disclosure, there is provided an adaptive speech synthesis apparatus including:
the basic voice data acquisition unit is used for acquiring basic voice data and text data corresponding to the basic voice data;
the basic voice model training unit is used for training a basic voice model according to the basic voice data and the text data corresponding to the basic voice data;
the speaker voice data acquisition unit is used for acquiring voice data of a speaker and text data corresponding to the voice data of the speaker;
the GRU model training unit is used for training a GRU voice model according to the voice data of the speaker, the text data corresponding to the voice data of the speaker and the basic voice model;
and the self-adaptive voice synthesis unit is used for synthesizing the voice of the speaker according to the GRU voice model and the character information contained in the instruction when the voice synthesis instruction is received.
Optionally, the GRU model training unit is specifically configured to determine an acoustic feature and a phoneme feature of the speaker according to the voice data of the speaker and text data corresponding to the voice data of the speaker;
the GRU speech model is trained based on the acoustic and phoneme characteristics of the speaker and the underlying speech model.
Optionally, when the GRU model training unit is configured to determine a phoneme feature of the speaker, the GRU model training unit is specifically configured to:
processing the voice data of the speaker and the text data corresponding to the voice data of the speaker according to a preset voice recognition model to obtain duration alignment information of phonemes of the speaker;
and determining the phoneme characteristics of the speaker according to the duration alignment information of the phoneme of the speaker and the text data corresponding to the voice data of the speaker.
Optionally, the GRU model training unit is further configured to:
the speech data of the speaker is preprocessed before determining the acoustic and phoneme characteristics of the speaker.
Optionally, the pre-processing comprises:
noise reduction and/or dereverberation.
Optionally, the GRU model training unit is further configured to:
and verifying that the voice data of the speaker corresponds to the text data corresponding to the voice data of the speaker one to one.
Optionally, the GRU model training unit is configured to, when training the GRU speech model according to the acoustic feature and the phoneme feature of the speaker and the basic speech model, specifically:
initializing a GRU voice model according to a basic voice model;
and processing the phoneme characteristics and the acoustic characteristics into a format required by the GRU voice model, and inputting the phoneme characteristics and the acoustic characteristics into the GRU voice model to finish the training of the GRU voice model.
According to yet another aspect of an embodiment of the present disclosure, there is provided a readable storage medium having executable instructions thereon, which when executed, cause a computer to perform the operations included in the above-mentioned method.
According to yet another aspect of embodiments of the present disclosure, there is provided a computing device comprising: a processor; and a memory storing executable instructions that, when executed, cause the processor to perform operations included in the above-described methods.
According to the technical scheme provided by the embodiment of the disclosure, a basic voice model is trained by utilizing large-scale basic voice data, and then a GRU voice model is trained by utilizing speaker voice data of a small sample, wherein the GRU model can well memorize context information of a text; therefore, the technical scheme provided by the embodiment of the disclosure can synthesize the voice of the speaker with good effect only by recording a small amount of voice data with low quality of the speaker.
Drawings
The accompanying drawings, which are included to provide a further understanding of the disclosure and are incorporated in and constitute a part of this specification, illustrate exemplary embodiments of the disclosure and together with the description serve to explain the principles of the disclosure.
FIG. 1 is a block diagram of an exemplary computing device;
FIG. 2 is a flow diagram of a method of adaptive speech synthesis according to an embodiment of the present disclosure;
FIG. 3 is a block diagram of a small sample based adaptive speech synthesis system according to an embodiment of the present disclosure;
fig. 4 is a block diagram of an adaptive speech synthesis apparatus according to an embodiment of the present disclosure.
Detailed Description
Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.
FIG. 1 is a block diagram of an example computing device 100 arranged to implement an adaptive speech synthesis method according to the present disclosure. In a basic configuration 102, computing device 100 typically includes system memory 106 and one or more processors 104. A memory bus 108 may be used for communication between the processor 104 and the system memory 106.
Depending on the desired configuration, the processor 104 may be any type of processing, including but not limited to: the processor 104 may include one or more levels of cache, such as a level one cache 110 and a level two cache 112, a processor core 114, and registers 116. the example processor core 114 may include an Arithmetic Logic Unit (ALU), a Floating Point Unit (FPU), a digital signal processing core (DSP core), or any combination thereof.
Depending on the desired configuration, system memory 106 may be any type of memory, including but not limited to: volatile memory (such as RAM), non-volatile memory (such as ROM, flash memory, etc.), or any combination thereof. System memory 106 may include an operating system 120, one or more programs 122, and program data 124. In some implementations, the program 122 can be configured to execute instructions on an operating system by one or more processors 104 using program data 124.
Computing device 100 may also include an interface bus 140 that facilitates communication from various interface devices (e.g., output devices 142, peripheral interfaces 144, and communication devices 146) to the basic configuration 102 via the bus/interface controller 130. The example output device 142 includes a graphics processing unit 148 and an audio processing unit 150. They may be configured to facilitate communication with various external devices, such as a display terminal or speakers, via one or more a/V ports 152. Example peripheral interfaces 144 may include a serial interface controller 154 and a parallel interface controller 156, which may be configured to facilitate communication with external devices such as input devices (e.g., keyboard, mouse, pen, voice input device, touch input device) or other peripherals (e.g., printer, scanner, etc.) via one or more I/O ports 158. An example communication device 146 may include a network controller 160, which may be arranged to facilitate communications with one or more other computing devices 162 over a network communication link via one or more communication ports 164.
A network communication link may be one example of a communication medium. Communication media may typically be embodied by computer readable instructions, data structures, program modules, and may include any information delivery media, such as carrier waves or other transport mechanisms, in a modulated data signal. A "modulated data signal" may be a signal that has one or more of its data set or its changes made in such a manner as to encode information in the signal. By way of non-limiting example, communication media may include wired media such as a wired network or private-wired network, and various wireless media such as acoustic, Radio Frequency (RF), microwave, Infrared (IR), or other wireless media. The term computer readable media as used herein may include both storage media and communication media.
Computing device 100 may be implemented as part of a small-form factor portable (or mobile) electronic device such as a cellular telephone, a Personal Digital Assistant (PDA), a personal media player device, a wireless web-watch device, a personal headset device, an application specific device, or a hybrid device that include any of the above functions. Computing device 100 may also be implemented as a personal computer including both desktop and notebook computer configurations.
Among other things, one or more programs 122 of computing device 100 include instructions for performing an adaptive speech synthesis method according to the present disclosure.
Fig. 2 illustrates a flow diagram of an adaptive speech synthesis method 200 according to an embodiment of the present disclosure, the adaptive speech synthesis method 200 starting at step S210.
Step S210, basic voice data and text data corresponding to the basic voice data are obtained.
S220, training a basic voice model according to the basic voice data and the text data corresponding to the basic voice data;
step S230, acquiring voice data of a speaker and text data corresponding to the voice data of the speaker;
step S240, training a GRU voice model according to the voice data of the speaker, the text data corresponding to the voice data of the speaker and the basic voice model;
and step S250, when a voice synthesis instruction is received, synthesizing the voice of the speaker according to the GRU voice model and the character information contained in the instruction.
The basic voice data refers to existing large-scale voice data. The method comprises the steps of training a basic voice model according to large-scale basic voice data and text data corresponding to the basic voice data, and then carrying out self-adaptive training based on a small amount of voice data of a speaker and the text data corresponding to the voice data of the speaker.
In step S240, a Gated Recurrent Unit (GRU) model is used, and the GRU model is a variant of a Recurrent neural network and can learn the long-term dependency of the time sequence. For speech synthesis, the model can well memorize the context information of the text, and in addition, the GRU model has simple structure and high decoding speed, and can meet the real-time requirement of on-line speech synthesis.
Optionally, step S240 includes:
step S241, determining acoustic characteristics and phoneme characteristics of the speaker according to the voice data of the speaker and text data corresponding to the voice data of the speaker;
and step S242, training a gating cycle unit GRU voice model according to the acoustic characteristics and the phoneme characteristics of the speaker and the basic voice model.
The acoustic features can be directly extracted from the voice data, and the acoustic features can be frequency spectrum features, fundamental frequency features and interpolation features. The interpolation characteristic specifically means that for a point with the fundamental frequency of 0, the average value of the front and rear sampling points is calculated as the value of the point, so that the problem that the fundamental frequency is not smooth is solved.
Determining a phoneme characteristic of the speaker, comprising:
processing the voice data of the speaker and the text data corresponding to the voice data of the speaker according to a preset voice recognition model to obtain duration alignment information of phonemes of the speaker;
and determining the phoneme characteristics of the speaker according to the duration alignment information of the phoneme of the speaker and the text data corresponding to the voice data of the speaker.
The phoneme characteristics refer to part-of-speech characteristics, context information and tone characteristics expressed according to the phoneme duration information. The part-of-speech characteristics include: verbs, nouns, emotional words; the context information is that a word and the words before and after are combined in various ways to generate a corresponding dictionary and can be used for training.
The preset speech recognition Model may be a Hidden Markov Model (HMM), which takes a text after word segmentation as an input and takes characteristics of a pronunciation spectrum, a pronunciation time, and the like of a word as an output training.
Optionally, before determining the acoustic feature and the phoneme feature of the speaker, the method further includes:
and preprocessing the voice data of the speaker.
Among other things, preprocessing includes, but is not limited to, noise reduction, dereverberation processing, thereby processing low-quality voice data input by a user into clearer voice data.
Optionally, before the voice data of the speaker and the text data corresponding to the voice data of the speaker are processed according to the preset voice recognition model, the voice data of the speaker and the text data are verified to be in one-to-one correspondence, so that the condition that training cannot be normally carried out due to mismatching of input data is avoided.
Optionally, in step S242, training the GRU speech model according to the acoustic feature and the phoneme feature of the speaker and the basic speech model includes:
initializing a GRU voice model according to a basic voice model;
and processing the phoneme characteristics and the acoustic characteristics into a format required by the GRU voice model, and inputting the phoneme characteristics and the acoustic characteristics into the GRU voice model to finish the training of the GRU voice model.
The GRU voice model training specifically takes phoneme characteristics as input and acoustic characteristics as output, and the voice synthesized by the trained GRU voice model accords with the voice characteristics of a speaker.
A specific embodiment of the present disclosure is given below in conjunction with fig. 3.
Fig. 3 is a block diagram of a structure of an adaptive speech synthesis system based on small samples, which specifically includes:
1. a data inspection module;
the module is mainly used for detecting whether the voice data and the text data input by the user are matched or not, and if not, informing the user of inputting again.
2. An automatic alignment module;
and recognizing the voice of the user by using a built-in voice recognition model, and matching the voice with corresponding text information to obtain the duration alignment information of each phoneme, namely the duration of pronunciation of each phoneme and the corresponding frequency spectrum.
3. A preprocessing module;
since the user's recording device is generally crude and the quality of the speech data is generally not very high, in order to obtain better synthesis effect, the system performs a series of preprocessing operations on the speech data at this module, including noise reduction and dereverberation operations, so as to obtain clearer speech data.
4. An acoustic feature extraction module;
the module extracts acoustic features of the processed voice data.
5. A phoneme feature extraction module;
the module inputs text data corresponding to the voice and duration alignment information of the phoneme output by the automatic alignment module, and outputs phoneme characteristics containing phoneme duration information.
6. A training data generation module;
the module processes the acoustic features and the phoneme features to generate a data format required by the training model.
7. A base model module;
a base model trained using existing large-scale data is obtained.
8. A training model module;
and training the voice synthesis model of the speaker by using the data generated by the training data generation module. Specifically, a model pre-trained according to big data is used for initializing the model weight, and then new small speaker data is used for carrying out iterative training on the model. The training model uses a GRU model, which is a variant of a recurrent neural network and can learn the long-term dependence of a time sequence. For speech synthesis, the model can well memorize the context information of the text, and in addition, the GRU model has simple structure and high decoding speed, and can meet the real-time requirement of on-line speech synthesis.
9. An automatic speech synthesis module;
and (3) training the obtained personalized speech synthesis system through a self-adaptive algorithm.
10. The system comprises an input text module and an output audio module;
respectively used for inputting the text of the voice to be synthesized and outputting the speaker voice automatically synthesized by the system.
The embodiment of the disclosure can train the speech synthesis system by only using a small amount of speaker data, thereby greatly reducing the cost of training the model; compared with a model trained by only using a small number of speaker voice data, the model trained by the invention has great improvement in the aspects of voice tone, intelligibility, accuracy and the like.
Referring to fig. 4, an embodiment of the present disclosure provides an adaptive speech synthesis apparatus, including:
a basic voice data obtaining unit 410, configured to obtain basic voice data and text data corresponding to the basic voice data;
a basic speech model training unit 420, configured to train a basic speech model according to the basic speech data and text data corresponding to the basic speech data;
a speaker voice data acquiring unit 430, configured to acquire voice data of a speaker and text data corresponding to the voice data of the speaker;
a GRU model training unit 440, configured to train a GRU speech model according to the speech data of the speaker and the text data corresponding to the speech data of the speaker, and the basic speech model;
the adaptive speech synthesis unit 450 is configured to synthesize the speech of the speaker according to the GRU speech model and the text information included in the command when receiving the speech synthesis command.
Optionally, the GRU model training unit 440 is specifically configured to determine an acoustic feature and a phoneme feature of the speaker according to the voice data of the speaker and text data corresponding to the voice data of the speaker;
the GRU speech model is trained based on the acoustic and phoneme characteristics of the speaker and the underlying speech model.
Optionally, when the GRU model training unit 440 is used to determine the phoneme characteristics of the speaker, it is specifically configured to:
processing the voice data of the speaker and the text data corresponding to the voice data of the speaker according to a preset voice recognition model to obtain duration alignment information of phonemes of the speaker;
and determining the phoneme characteristics of the speaker according to the duration alignment information of the phoneme of the speaker and the text data corresponding to the voice data of the speaker.
Optionally, the GRU model training unit 440 is further configured to:
the speech data of the speaker is preprocessed before determining the acoustic and phoneme characteristics of the speaker.
Optionally, the pre-processing comprises:
noise reduction and/or dereverberation.
Optionally, the GRU model training unit 440 is further configured to:
and verifying that the voice data of the speaker corresponds to the text data corresponding to the voice data of the speaker one to one.
Optionally, the GRU model training unit 440 is configured to, when training the GRU speech model according to the acoustic features and phoneme features of the speaker and the basic speech model, specifically:
initializing a GRU voice model according to a basic voice model;
and processing the phoneme characteristics and the acoustic characteristics into a format required by the GRU voice model, and inputting the phoneme characteristics and the acoustic characteristics into the GRU voice model to finish the training of the GRU voice model.
For specific limitations of the adaptive speech synthesis apparatus, reference may be made to the above limitations of the adaptive speech synthesis method, which are not described herein again.
It should be understood that the various techniques described herein may be implemented in connection with hardware or software or, alternatively, with a combination of both. Thus, the methods and apparatus of the present disclosure, or certain aspects or portions thereof, may take the form of program code (i.e., instructions) embodied in tangible media, such as floppy diskettes, CD-ROMs, hard drives, or any other machine-readable storage medium, wherein, when the program is loaded into and executed by a machine, such as a computer, the machine becomes an apparatus for practicing the disclosure.
In the case of program code execution on programmable computers, the computing device will generally include a processor, a storage medium readable by the processor (including volatile and non-volatile memory and/or storage elements), at least one input device, and at least one output device. Wherein the memory is configured to store program code; the processor is configured to perform the various methods of the present disclosure according to instructions in the program code stored in the memory.
By way of example, and not limitation, computer readable media may comprise computer storage media and communication media. Computer-readable media includes both computer storage media and communication media. Computer storage media store information such as computer readable instructions, data structures, program modules or other data. Communication media typically embodies computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media. Combinations of any of the above are also included within the scope of computer readable media.
It should be appreciated that in the foregoing description of exemplary embodiments of the disclosure, various features of the disclosure are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the disclosure and aiding in the understanding of one or more of the various disclosed aspects. However, the disclosed method should not be interpreted as reflecting an intention that: that is, the claimed disclosure requires more features than are expressly recited in each claim. Rather, as the following claims reflect, disclosed aspects lie in less than all features of a single foregoing disclosed embodiment. Thus, the claims following the detailed description are hereby expressly incorporated into this detailed description, with each claim standing on its own as a separate embodiment of this disclosure.
Those skilled in the art will appreciate that the modules or units or components of the devices in the examples disclosed herein may be arranged in a device as described in this embodiment or alternatively may be located in one or more devices different from the devices in this example. The modules in the foregoing examples may be combined into one module or may be further divided into multiple sub-modules.
Those skilled in the art will appreciate that the modules in the device in an embodiment may be adaptively changed and disposed in one or more devices different from the embodiment. The modules or units or components of the embodiments may be combined into one module or unit or component, and furthermore they may be divided into a plurality of sub-modules or sub-units or sub-components. All of the features disclosed in this specification (including any accompanying claims, abstract and drawings), and all of the processes or elements of any method or apparatus so disclosed, may be combined in any combination, except combinations where at least some of such features and/or processes or elements are mutually exclusive. Each feature disclosed in this specification (including any accompanying claims, abstract and drawings) may be replaced by alternative features serving the same, equivalent or similar purpose, unless expressly stated otherwise.
Moreover, those skilled in the art will appreciate that while some embodiments described herein include some features included in other embodiments, rather than other features, combinations of features of different embodiments are meant to be within the scope of the disclosure and form different embodiments. For example, in the following claims, any of the claimed embodiments may be used in any combination.
Furthermore, some of the described embodiments are described herein as a method or combination of method elements that can be performed by a processor of a computer system or by other means of performing the described functions. A processor having the necessary instructions for carrying out the method or method elements thus forms a means for carrying out the method or method elements. Further, the elements of the apparatus embodiments described herein are examples of the following apparatus: the apparatus is used to implement the functions performed by the elements for the purposes of this disclosure.
As used herein, unless otherwise specified the use of the ordinal adjectives "first", "second", "third", etc., to describe a common object, merely indicate that different instances of like objects are being referred to, and are not intended to imply that the objects so described must be in a given sequence, either temporally, spatially, in ranking, or in any other manner.
While the disclosure has been described with respect to a limited number of embodiments, those skilled in the art, having benefit of this description, will appreciate that other embodiments can be devised which do not depart from the scope of the disclosure as described herein. Moreover, it should be noted that the language used in the specification has been principally selected for readability and instructional purposes, and may not have been selected to delineate or circumscribe the disclosed subject matter. Accordingly, many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the appended claims. The disclosure of the present disclosure is intended to be illustrative, but not limiting, of the scope of the disclosure, which is set forth in the following claims.

Claims (8)

1. An adaptive speech synthesis method, comprising:
acquiring basic voice data and text data corresponding to the basic voice data;
training a basic voice model according to the basic voice data and text data corresponding to the basic voice data;
acquiring voice data of a speaker and text data corresponding to the voice data of the speaker;
determining acoustic characteristics and phoneme characteristics of the speaker according to the voice data of the speaker and text data corresponding to the voice data of the speaker;
initializing a GRU voice model according to the basic voice model;
processing the phoneme characteristics and the acoustic characteristics into a format required by the GRU voice model, and inputting the phoneme characteristics and the acoustic characteristics into the GRU voice model to finish the training of the GRU voice model;
and when a voice synthesis instruction is received, synthesizing the voice of the speaker according to the GRU voice model and the text information contained in the instruction.
2. The method of claim 1, determining the phonetic features of the speaker, comprising:
processing the voice data of the speaker and the text data corresponding to the voice data of the speaker according to a preset voice recognition model to obtain duration alignment information of phonemes of the speaker;
and determining the phoneme characteristics of the speaker according to the duration alignment information of the phoneme of the speaker and the text data corresponding to the voice data of the speaker.
3. The method of claim 2, wherein before processing the speech data of the speaker and the text data corresponding to the speech data of the speaker according to a preset speech recognition model, further comprising:
and verifying that the voice data of the speaker corresponds to the text data one to one.
4. The method of claim 1, wherein determining the acoustic and phoneme characteristics of the speaker further comprises:
and preprocessing the voice data of the speaker.
5. The method of claim 4, wherein the pre-processing comprises:
noise reduction and/or dereverberation.
6. An adaptive speech synthesis apparatus, comprising:
the basic voice data acquisition unit is used for acquiring basic voice data and text data corresponding to the basic voice data;
the basic voice model training unit is used for training a basic voice model according to the basic voice data and the text data corresponding to the basic voice data;
the device comprises a speaker voice data acquisition unit, a voice data acquisition unit and a voice data acquisition unit, wherein the speaker voice data acquisition unit is used for acquiring voice data of a speaker and text data corresponding to the voice data of the speaker;
the GRU model training unit is used for determining the acoustic characteristic and the phoneme characteristic of the speaker according to the voice data of the speaker and the text data corresponding to the voice data of the speaker; initializing a GRU voice model according to the basic voice model; processing the phoneme characteristics and the acoustic characteristics into a format required by the GRU voice model, and inputting the phoneme characteristics and the acoustic characteristics into the GRU voice model to finish the training of the GRU voice model;
and the self-adaptive voice synthesis unit is used for synthesizing the voice of the speaker according to the GRU voice model and the character information contained in the command when receiving the voice synthesis command.
7. A readable storage medium having executable instructions thereon that, when executed, cause a computer to perform the operations included in any one of claims 1-5.
8. A computing device, comprising:
a processor; and
a memory storing executable instructions that, when executed, cause the processor to perform the operations included in any one of claims 1-5.
CN201910661648.6A 2019-07-22 2019-07-22 Adaptive speech synthesis method, device, readable storage medium and computing equipment Active CN110379407B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910661648.6A CN110379407B (en) 2019-07-22 2019-07-22 Adaptive speech synthesis method, device, readable storage medium and computing equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910661648.6A CN110379407B (en) 2019-07-22 2019-07-22 Adaptive speech synthesis method, device, readable storage medium and computing equipment

Publications (2)

Publication Number Publication Date
CN110379407A CN110379407A (en) 2019-10-25
CN110379407B true CN110379407B (en) 2021-10-19

Family

ID=68254684

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910661648.6A Active CN110379407B (en) 2019-07-22 2019-07-22 Adaptive speech synthesis method, device, readable storage medium and computing equipment

Country Status (1)

Country Link
CN (1) CN110379407B (en)

Families Citing this family (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111276120B (en) * 2020-01-21 2022-08-19 华为技术有限公司 Speech synthesis method, apparatus and computer-readable storage medium
CN112133282B (en) * 2020-10-26 2022-07-08 厦门大学 Lightweight multi-speaker speech synthesis system and electronic equipment
CN112185340B (en) * 2020-10-30 2024-03-15 网易(杭州)网络有限公司 Speech synthesis method, speech synthesis device, storage medium and electronic equipment
WO2022094740A1 (en) * 2020-11-03 2022-05-12 Microsoft Technology Licensing, Llc Controlled training and use of text-to-speech models and personalized model generated voices
CN112365876B (en) * 2020-11-27 2022-04-12 北京百度网讯科技有限公司 Method, device and equipment for training speech synthesis model and storage medium
CN112634856B (en) * 2020-12-10 2022-09-02 思必驰科技股份有限公司 Speech synthesis model training method and speech synthesis method
CN113314092A (en) * 2021-05-11 2021-08-27 北京三快在线科技有限公司 Method and device for model training and voice interaction
CN113299268A (en) * 2021-07-28 2021-08-24 成都启英泰伦科技有限公司 Speech synthesis method based on stream generation model
CN113707122B (en) * 2021-08-11 2024-04-05 北京搜狗科技发展有限公司 Method and device for constructing voice synthesis model

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105185372A (en) * 2015-10-20 2015-12-23 百度在线网络技术(北京)有限公司 Training method for multiple personalized acoustic models, and voice synthesis method and voice synthesis device
US9922641B1 (en) * 2012-10-01 2018-03-20 Google Llc Cross-lingual speaker adaptation for multi-lingual speech synthesis
CN108550363A (en) * 2018-06-04 2018-09-18 百度在线网络技术(北京)有限公司 Phoneme synthesizing method and device, computer equipment and readable medium
CN108665898A (en) * 2018-03-30 2018-10-16 西北师范大学 A method of gesture is converted into the Chinese and hides double-language voice
CN108831435A (en) * 2018-06-06 2018-11-16 安徽继远软件有限公司 A kind of emotional speech synthesizing method based on susceptible sense speaker adaptation
CN109599092A (en) * 2018-12-21 2019-04-09 秒针信息技术有限公司 A kind of audio synthetic method and device

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR100281584B1 (en) * 1998-12-24 2001-02-15 이계철 Program and data initialization method in voice recognition system
JP6523893B2 (en) * 2015-09-16 2019-06-05 株式会社東芝 Learning apparatus, speech synthesis apparatus, learning method, speech synthesis method, learning program and speech synthesis program
CN107871497A (en) * 2016-09-23 2018-04-03 北京眼神科技有限公司 Audio recognition method and device
JP6553584B2 (en) * 2016-12-12 2019-07-31 日本電信電話株式会社 Basic frequency model parameter estimation apparatus, method, and program
JP7013172B2 (en) * 2017-08-29 2022-01-31 株式会社東芝 Speech synthesis dictionary distribution device, speech synthesis distribution system and program
CN109523993B (en) * 2018-11-02 2022-02-08 深圳市网联安瑞网络科技有限公司 Voice language classification method based on CNN and GRU fusion deep neural network
CN109635462A (en) * 2018-12-17 2019-04-16 深圳前海微众银行股份有限公司 Model parameter training method, device, equipment and medium based on federation's study
CN109885832A (en) * 2019-02-14 2019-06-14 平安科技(深圳)有限公司 Model training, sentence processing method, device, computer equipment and storage medium

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9922641B1 (en) * 2012-10-01 2018-03-20 Google Llc Cross-lingual speaker adaptation for multi-lingual speech synthesis
CN105185372A (en) * 2015-10-20 2015-12-23 百度在线网络技术(北京)有限公司 Training method for multiple personalized acoustic models, and voice synthesis method and voice synthesis device
CN108665898A (en) * 2018-03-30 2018-10-16 西北师范大学 A method of gesture is converted into the Chinese and hides double-language voice
CN108550363A (en) * 2018-06-04 2018-09-18 百度在线网络技术(北京)有限公司 Phoneme synthesizing method and device, computer equipment and readable medium
CN108831435A (en) * 2018-06-06 2018-11-16 安徽继远软件有限公司 A kind of emotional speech synthesizing method based on susceptible sense speaker adaptation
CN109599092A (en) * 2018-12-21 2019-04-09 秒针信息技术有限公司 A kind of audio synthetic method and device

Also Published As

Publication number Publication date
CN110379407A (en) 2019-10-25

Similar Documents

Publication Publication Date Title
CN110379407B (en) Adaptive speech synthesis method, device, readable storage medium and computing equipment
CN108520741B (en) Method, device and equipment for restoring ear voice and readable storage medium
JP6837298B2 (en) Devices and methods for calculating acoustic scores, devices and methods for recognizing voice, and electronic devices
US7124081B1 (en) Method and apparatus for speech recognition using latent semantic adaptation
CN110232907B (en) Voice synthesis method and device, readable storage medium and computing equipment
CN110838289A (en) Awakening word detection method, device, equipment and medium based on artificial intelligence
US20240021202A1 (en) Method and apparatus for recognizing voice, electronic device and medium
CN107240401B (en) Tone conversion method and computing device
CN110379415B (en) Training method of domain adaptive acoustic model
CN108492818B (en) Text-to-speech conversion method and device and computer equipment
GB2557714A (en) Determining phonetic relationships
CN110335608B (en) Voiceprint verification method, voiceprint verification device, voiceprint verification equipment and storage medium
CN111402861A (en) Voice recognition method, device, equipment and storage medium
CN110379411B (en) Speech synthesis method and device for target speaker
CN108564944B (en) Intelligent control method, system, equipment and storage medium
JP6875819B2 (en) Acoustic model input data normalization device and method, and voice recognition device
CN112489623A (en) Language identification model training method, language identification method and related equipment
US11295732B2 (en) Dynamic interpolation for hybrid language models
CN113449089B (en) Intent recognition method, question-answering method and computing device of query statement
CN114495905A (en) Speech recognition method, apparatus and storage medium
CN113327575A (en) Speech synthesis method, device, computer equipment and storage medium
CN113421571B (en) Voice conversion method and device, electronic equipment and storage medium
CN111477212A (en) Content recognition, model training and data processing method, system and equipment
CN113327578B (en) Acoustic model training method and device, terminal equipment and storage medium
CN110728137B (en) Method and device for word segmentation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant