CN111681639B - Multi-speaker voice synthesis method, device and computing equipment - Google Patents

Multi-speaker voice synthesis method, device and computing equipment Download PDF

Info

Publication number
CN111681639B
CN111681639B CN202010471223.1A CN202010471223A CN111681639B CN 111681639 B CN111681639 B CN 111681639B CN 202010471223 A CN202010471223 A CN 202010471223A CN 111681639 B CN111681639 B CN 111681639B
Authority
CN
China
Prior art keywords
data
speaker
voice
frequency
speaker voice
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010471223.1A
Other languages
Chinese (zh)
Other versions
CN111681639A (en
Inventor
殷昊
陈云琳
江明奇
雷欣
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Mobvoi Information Technology Co ltd
Original Assignee
Shanghai Mobvoi Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Mobvoi Information Technology Co ltd filed Critical Shanghai Mobvoi Information Technology Co ltd
Priority to CN202010471223.1A priority Critical patent/CN111681639B/en
Publication of CN111681639A publication Critical patent/CN111681639A/en
Application granted granted Critical
Publication of CN111681639B publication Critical patent/CN111681639B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • G10L13/04Details of speech synthesis systems, e.g. synthesiser structure or memory management
    • G10L13/047Architecture of speech synthesisers

Abstract

The disclosure provides a method, a device, a readable storage medium and a computing device for multi-speaker speech synthesis, which solve the problem of degradation of speech synthesis quality caused by unbalanced data volume of speech data of different voice types in multi-speaker speech synthesis, wherein the method comprises the following steps: acquiring multi-speaker voice data comprising at least two voice types; performing data enhancement processing on the multi-speaker voice data; inputting the multi-speaker voice data into a multi-speaker voice synthesis system for data training; after training the multi-speaker speech synthesis system, inputting instructions containing specified speakers and specified texts to the multi-speaker speech synthesis system, and instructing the multi-speaker speech synthesis system to synthesize speech.

Description

Multi-speaker voice synthesis method, device and computing equipment
Technical Field
The present disclosure relates to the field of speech synthesis technology, and in particular, to a method and apparatus for synthesizing multi-speaker speech, a readable storage medium, and a computing device.
Background
Speech synthesis (TTS) refers To a technique in which a computer automatically generates a corresponding Speech from Text. Current speech synthesis systems require the use of large amounts of high quality (requiring specialized recording equipment for recording) data and can synthesize only one timbre of sound. If a voice synthesis system with a plurality of different speaker timbres needs to be built, a great deal of financial resources and material resources are consumed.
The current mainstream optimization method is to use a multi-speaker speech synthesis technology (multi-speaker TTS), which can synthesize sounds with different timbres through one system. Specifically, the multi-speaker speech synthesis system distinguishes between different speakers' voices through the speaker ID (speaker id) during training. In the stage of synthesizing sound, the sound of different speakers is generated by inputting different speaker ids. Compared with the traditional single-speaker voice synthesis system, the technology can combine voice data of multiple speakers, on one hand, the data volume is increased, so that model training is more sufficient, and on the other hand, commonalities of different voice sounds can be extracted, so that the model is more robust.
In training a multi-speaker speech synthesis system, it is necessary to prepare voice data of speakers of different colors as training data. In general, in order to maintain a balance of different sound types, a considerable number of sounds of men/women, adults/children are prepared. In reality, however, a very small amount of some type of sound data will often occur, and if some tone sounds are very rare in the training data relative to other tone sounds, the resulting synthesized sound will be less similar to the real sound. The conventional practice is to continue to add this type of sound data, which does improve the data type imbalance problem, but this can greatly increase the cost as the sound data recording required to train the speech synthesis system is very expensive.
Disclosure of Invention
To this end, the present disclosure provides a multi-speaker speech synthesis method, apparatus, readable storage medium, and computing device in an effort to solve or at least alleviate at least one of the problems presented above.
According to an aspect of the embodiments of the present disclosure, there is provided a multi-speaker speech synthesis method, including:
acquiring multi-speaker voice data comprising at least two voice types;
performing data enhancement processing on the voice data of multiple speakers;
inputting the multi-speaker voice data into a multi-speaker voice synthesis system for data training;
after training the multi-speaker speech synthesis system, inputting instructions containing the specified speaker and the specified text to the multi-speaker speech synthesis system to instruct the multi-speaker speech synthesis system to synthesize speech.
Optionally, the data enhancement processing is performed on the multi-speaker voice data, including:
determining a specific sound type requiring data enhancement processing and the number of data enhancement processing according to the data quantity of the speaker voice data of each sound type;
converting speaker voice data of a specific sound type into frequency domain data;
setting the energy values of one or more designated frequency intervals of the frequency domain data to zero in sequence according to the data enhancement processing times, and respectively generating a plurality of new speaker voice data of the specific sound type; wherein the specified frequency interval is divided in advance in a frequency range of the frequency domain data.
Optionally, determining the number of data enhancement processes includes:
the number of data enhancement processes is determined based on the ratio of the amount of data of the speaker's voice data of the particular voice type to the amount of data of the speaker's voice data of the other voice type.
Optionally, dividing the specified frequency interval in the frequency range of the frequency domain data includes:
determining the number of the designated frequency intervals according to the data enhancement processing times;
dividing the specified frequency intervals in the frequency range of the frequency domain data according to the number of the specified frequency intervals.
Optionally, the specified frequency interval is uniformly divided within the frequency range of the frequency domain data; or alternatively, the process may be performed,
according to human voice characteristics, unevenly dividing a designated frequency interval in a frequency range of frequency domain data; or alternatively, the process may be performed,
and determining a division mode of a designated frequency interval with the optimal multi-speaker voice synthesis result by adopting a machine learning mode.
Alternatively, a fast fourier transform method is used when converting speaker voice data of a specific sound type into frequency domain data.
Optionally, acquiring multi-speaker speech data including at least two sound types includes:
multi-speaker speech data including at least two sound types of adult male, adult female, male and female boy sounds is acquired.
According to still another aspect of the embodiments of the present disclosure, there is provided a multi-speaker speech synthesis apparatus including:
the data acquisition unit is used for acquiring multi-speaker voice data containing at least two sound types;
the data enhancement unit is used for performing data enhancement processing on the multi-speaker voice data;
the training unit is used for inputting the multi-speaker voice data into the multi-speaker voice synthesis system to perform data training;
and the voice synthesis unit is used for inputting an instruction containing a designated speaker and a designated text to the multi-speaker voice synthesis system after the multi-speaker voice synthesis system is trained, and instructing the multi-speaker voice synthesis system to synthesize voice.
According to yet another aspect of embodiments of the present disclosure, there is provided a readable storage medium having executable instructions thereon, which when executed, cause a computer to perform the above-described multi-speaker speech synthesis method.
According to yet another aspect of embodiments of the present disclosure, there is provided a computing device comprising:
one or more processors;
a memory; and
one or more programs, wherein the one or more programs are stored in the memory and configured to perform the multi-speaker speech synthesis method described above by the one or more processors.
According to an embodiment of the present disclosure, multi-speaker speech data including at least two sound types is acquired; performing data enhancement processing on the voice data of multiple speakers; inputting the multi-speaker voice data into a multi-speaker voice synthesis system for data training; after training the multi-speaker speech synthesis system, inputting an instruction containing a designated speaker and a designated text to the multi-speaker speech synthesis system to instruct the multi-speaker speech synthesis system to synthesize speech; the problem of unbalanced data is solved in the multi-speaker voice synthesis system by a data enhancement method, and meanwhile, extra cost is not increased; in another embodiment of the present disclosure, a novel data enhancement method is provided, which achieves the goal of increasing data by shielding energy values at a certain frequency segment, and it is verified that the data enhancement method can train a more optimal multi-speaker speech synthesis system.
Drawings
The accompanying drawings, which are included to provide a further understanding of the disclosure and are incorporated in and constitute a part of this specification, illustrate exemplary embodiments of the disclosure and together with the description serve to explain the principles of the disclosure.
FIG. 1 is a block diagram of an exemplary computing device;
FIG. 2 is a flow chart of a method of multi-speaker speech synthesis according to an embodiment of the present disclosure;
FIG. 3 is a flow chart of a data enhancement process in a multi-speaker speech synthesis method according to an embodiment of the present disclosure;
FIG. 4 is a flow chart of a method of multi-speaker speech synthesis in accordance with a specific embodiment of the present disclosure;
fig. 5 is a block diagram of a multi-speaker speech synthesis apparatus according to an embodiment of the present disclosure.
Detailed Description
Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.
Fig. 1 is a block diagram of an example computing device 100 that implements a multi-speaker speech synthesis method according to the present disclosure. In a basic configuration 102, computing device 100 typically includes a system memory 106 and one or more processors 104. The memory bus 108 may be used for communication between the processor 104 and the system memory 106.
Depending on the desired configuration, the processor 104 may be any type of processing including, but not limited to: a microprocessor (μp), a microcontroller (μc), a digital information processor (DSP), or any combination thereof. The processor 104 may include one or more levels of caches, such as a first level cache 110 and a second level cache 112, a processor core 114, and registers 116. The example processor core 114 may include an Arithmetic Logic Unit (ALU), a Floating Point Unit (FPU), a digital signal processing core (DSP core), or any combination thereof. The example memory controller 118 may be used with the processor 104, or in some implementations, the memory controller 118 may be an internal part of the processor 104.
Depending on the desired configuration, system memory 106 may be any type of memory including, but not limited to: volatile memory (such as RAM), non-volatile memory (such as ROM, flash memory, etc.), or any combination thereof. The system memory 106 may include an operating system 120, one or more programs 122, and program data 124. In some implementations, the program 122 may be configured to execute instructions on an operating system by the one or more processors 104 using the program data 124.
Computing device 100 may also include an interface bus 140 that facilitates communication from various interface devices (e.g., output devices 142, peripheral interfaces 144, and communication devices 146) to basic configuration 102 via bus/interface controller 130. The example output device 142 includes a graphics processing unit 148 and an audio processing unit 150. They may be configured to facilitate communication with various external devices, such as a display terminal or speakers, via one or more a/V ports 152. Example peripheral interfaces 144 may include a serial interface controller 154 and a parallel interface controller 156, which may be configured to facilitate communication with external devices such as input devices (e.g., keyboard, mouse, pen, voice input device, touch input device) or other peripherals (e.g., printer, scanner, etc.) via one or more I/O ports 158. An example communication device 146 may include a network controller 160, which may be arranged to facilitate communication with one or more other computing devices 162 via one or more communication ports 164 over a network communication link.
The network communication link may be one example of a communication medium. Communication media may typically be embodied by computer readable instructions, data structures, program modules, and may include any information delivery media in a modulated data signal, such as a carrier wave or other transport mechanism. A "modulated data signal" may be a signal that has one or more of its data set or changed in such a manner as to encode information in the signal. By way of non-limiting example, communication media may include wired media such as a wired network or special purpose network, and wireless media such as acoustic, radio Frequency (RF), microwave, infrared (IR) or other wireless media. The term computer readable media as used herein may include both storage media and communication media.
Computing device 100 may be implemented as part of a small-sized portable (or mobile) electronic device, which may be a Personal Digital Assistant (PDA), a wireless web-watch device, a personal headset device, an application-specific device, or a hybrid device that may include any of the above functions. Computing device 100 may also be implemented as a personal computer including desktop and notebook computer configurations.
Wherein the one or more programs 122 of the computing device 100 include instructions for performing a multi-speaker speech synthesis method according to the present disclosure.
Traditional speech synthesis techniques are based on a single speaker data training model, which trains a different model for each speaker, eventually resulting in a plurality of different speech synthesis systems. The single speaker speech synthesis system has a number of drawbacks such as high cost, poor robustness, and redundancy of models. The multi-speaker speech synthesis technology can solve the problems, but when the data is unbalanced, the problems of poor synthesized tone quality of the voice with less data can occur. The multi-speaker voice synthesis method provided by the disclosure can effectively alleviate the problem of unbalanced data without increasing additional cost.
Fig. 2 schematically illustrates a flow chart of a multi-speaker speech synthesis method 200 according to one embodiment of the present disclosure, the multi-speaker speech synthesis method 200 beginning at step S210.
In step S210, multi-speaker speech data including at least two sound types is acquired.
Depending on the timbre, sound types can be divided into: male and female voices; alternatively, adult sounds (including adult male sounds and adult female sounds), child sounds (including male sounds and female sounds), and the like. The acquired multi-speaker voice data of each sound type includes speaker voice data of one person or a plurality of persons. Because the data volume of the speaker voice data of various voice types is unbalanced, for example, the data volume of male voice is less, the data volume of female voice is not easy to obtain, if the multi-speaker voice data is directly applied to a multi-speaker voice synthesis system to carry out data training, the synthesis effect is not ideal, and the voice synthesis quality of a speaker with small data volume is poor.
Subsequently, in step S220, data enhancement processing is performed on the multi-speaker voice data.
The data enhancement refers to the expansion of data through a certain method in a scene with insufficient training data. As shown in fig. 3, in one embodiment of the present disclosure, there is provided a data enhancement processing method of multi-speaker voice data, including:
s310, determining a specific sound type and the number of data enhancement processing times which need the data enhancement processing according to the data quantity of the speaker voice data of each sound type;
specifically, in step S310, determining the number of data enhancement processes includes: the number of data enhancement processes is determined based on the ratio of the amount of data of the speaker's voice data of the particular voice type to the amount of data of the speaker's voice data of the other voice type. Wherein the data amount of the voice data of each speaker is the same or different-for example, the voice data of the speaker of the male voice has 1 voice data of the speaker, 20 frames of voice are taken as the total, the voice data of the speaker of the female voice comprises 10 voice data of the speaker, and each data of the speaker comprises 20 frames of voice, then the data enhancement processing is required to be carried out on the voice data of the male voice, and the data enhancement processing times are 10/1-1=9 times; for another example, the speaker voice data of male voice has voice data of 1 person, 20 frames of voice in total, the speaker voice data of female voice includes voice data of 10 persons, each person's data contains 10 to 30 frames of voice, 200 frames of voice in total, and the number of data enhancement processing is 200/20 to 1=9 times.
In an alternative embodiment, the other sound types specifically refer to the sound type with the highest data amount of the speaker voice data, and determining the number of data enhancement times includes: the number of data enhancements is determined based on the ratio of the highest amount of data of the speaker's voice data to the amount of data of the speaker's voice data of the particular voice type. For example, the speaker voice data of male voice has 2 voice data of 2 persons, each person's data contains 20 frames of voice, the speaker voice data of female voice has 1 person's voice data, 20 frames of voice are taken as a total, the speaker voice data of female voice includes 8 persons 'voice data, each person's data contains 20 frames of voice, then data enhancement processing is required for male voice data, the number of data enhancement processing is 160/40-1=3 times, and data enhancement processing is required for female voice data, the number of data enhancement processing is 160/20-1=7 times.
After data enhancement processing, the data volume of the voice data of the speakers with different voice types is the same, or is close to, or reaches the minimum proportion required by meeting the voice synthesis quality requirement, so that the problem of unbalanced data is solved. However, maintaining a certain difference between training data is a key to improve the quality of the model, so the present disclosure proposes a new data enhancement approach in step S320 and step S330.
Specifically, in step S320, the speaker voice data of the specific sound type is converted into frequency domain data, and for example, the conversion from the time domain to the frequency domain may be completed by a fast fourier transform (Fast Fourier Transform, FFT) method.
Subsequently, in step S330, according to the number of data enhancement processes, energy values of one or more designated frequency intervals of the frequency domain data are set to zero in sequence, so as to generate a plurality of new speaker voice data of a specific sound type respectively; wherein the specified frequency interval is divided in advance in a frequency range of the frequency domain data.
In an alternative embodiment, the number of the designated frequency intervals is determined in advance according to the number of data enhancement processing times; the designated frequency bins are divided in the frequency range of the frequency domain data according to the number of the designated frequency bins. For example, if the number of data enhancement processes is determined to be 8, the frequency interval is divided into 8 intervals.
Further, dividing the specified frequency interval within the frequency range of the frequency domain data includes: uniformly dividing a designated frequency interval in the frequency range of the frequency domain data; for example, it is necessary to divide the frequency domain into 8 segments, and the frequency range of the frequency domain data after FFT is 1 to 512, and then divide 512 into 8 frequency segments on average, each frequency segment having a length of 64, and further copy 8 pieces of the same voice data, the first set setting the energy value of the frequency spectrum 1 to 64 frequency segments after FFT to 0, the second set setting the energy value of the frequency spectrum 64 to 128 frequency segments after FFT to 0, the eighth set setting the energy value of the frequency spectrum 448 to 512 frequency segments after FFT to 0. And adding the newly expanded 8 data serving as new speaker data into the original training data to train a model. At this time, the original fewer sounds (such as male sounds) are generated, and the problem of unbalanced data is solved through data expansion.
Or, according to the human voice characteristics, the designated frequency intervals are unevenly divided in the frequency range of the frequency domain data, so that the lost information quantity is approximately the same after the energy values of different designated frequency intervals are set to zero. For example, since the information distribution of human voice in each frequency band is uneven and the amount of information of low frequency in daily speaking is small, the width of the low frequency section is larger than that of the medium-high frequency section in the designated section can be set.
Furthermore, a machine learning mode can be adopted to obtain the frequency interval division mode with the optimal effect, so that the optimal multi-speaker voice synthesis result is obtained.
In the above embodiments, the respective designated frequency bins may or may not overlap with each other.
Subsequently, in step S240, after training is completed on the multi-speaker speech synthesis system, an instruction including a specified speaker and a specified text is input to the multi-speaker speech synthesis system, instructing the multi-speaker speech synthesis system to synthesize speech.
Specific embodiments of the present disclosure are given below.
As shown in fig. 4, the present embodiment includes the following steps:
step 1, data enhancement is carried out on a class of sound with less data, firstly, m parts of the data are expanded, and then, the energy value of each part of the data on different frequencies is set to be 0. By discarding energy values at a certain frequency, the sound sounds different from the original sound, corresponding to the addition of a new speaker.
And 2, fusing m data obtained through data enhancement with the original data to train the multi-speaker voice synthesis system.
And 3, after model training is finished, inputting a text to be synthesized into the model, and simultaneously providing a speaker voice (speeker id) of a voice type to be synthesized, so that a voice corresponding to the speeker id can be synthesized.
Referring to fig. 5, a multi-speaker speech synthesis apparatus provided in an embodiment of the present disclosure includes:
a data acquisition unit 510 for acquiring multi-speaker voice data including at least two sound types;
a data enhancement unit 520 for performing data enhancement processing on the multi-speaker voice data;
a training unit 530 for inputting the multi-speaker speech data into the multi-speaker speech synthesis system for data training;
the speech synthesis unit 540 is configured to input an instruction including a specified speaker and a specified text to the multi-speaker speech synthesis system after training is completed for the multi-speaker speech synthesis system, and instruct the multi-speaker speech synthesis system to synthesize speech.
Optionally, the data enhancing unit 520 is specifically configured to:
determining a specific sound type requiring data enhancement processing and the number of data enhancement processing according to the data quantity of the speaker voice data of each sound type;
converting speaker voice data of a specific sound type into frequency domain data;
setting the energy values of one or more designated frequency intervals of the frequency domain data to zero in sequence according to the data enhancement processing times, and respectively generating a plurality of new speaker voice data of a specific sound type; wherein the specified frequency interval is divided in advance in a frequency range of the frequency domain data.
Optionally, when the data enhancement unit 520 is configured to determine the number of data enhancement processing, the data enhancement unit is specifically configured to:
the number of data enhancement processes is determined based on the ratio of the amount of data of the speaker's voice data of the particular voice type to the amount of data of the speaker's voice data of the other voice type.
Optionally, the data enhancing unit 520 is further configured to:
determining the number of the designated frequency intervals according to the data enhancement processing times;
the designated frequency bins are divided in the frequency range of the frequency domain data according to the number of the designated frequency bins.
Optionally, the data enhancing unit 520 is configured to, when dividing the specified frequency interval in the frequency range of the frequency domain data, specifically:
uniformly dividing a designated frequency interval in the frequency range of the frequency domain data; or alternatively, the process may be performed,
according to human voice characteristics, unevenly dividing a designated frequency interval in a frequency range of frequency domain data; or alternatively, the process may be performed,
and determining a division mode of a designated frequency interval with the optimal multi-speaker voice synthesis result by adopting a machine learning mode.
Alternatively, the data enhancing unit 520 is configured to use a fast fourier transform FFT method when converting speaker voice data of a specific sound type into frequency domain data.
Optionally, the data acquisition unit 510 is specifically configured to:
multi-speaker speech data including at least two sound types of adult male, adult female, male and female boy sounds is acquired.
According to the technical scheme, in the multi-speaker voice synthesis system, the problem of unbalanced data is solved through a data enhancement method, and meanwhile, extra cost is not increased. Meanwhile, a novel data enhancement method is provided, and the purpose of increasing data is achieved by shielding energy values on a certain section of frequency. By comparing the multi-speaker speech synthesis systems trained with and without the data enhancement method, it can be found that the system with data enhancement has better synthesized sound quality for a lesser number of classes of timbres, and the user mean opinion score (Mean Opinion Score, MOS) is also higher than the system without the data enhancement method.
It should be understood that the various techniques described herein may be implemented in connection with hardware or software or, alternatively, with a combination of both. Thus, the methods and apparatus of the present disclosure, or certain aspects or portions of the methods and apparatus of the present disclosure, may take the form of program code (i.e., instructions) embodied in tangible media, such as floppy diskettes, CD-ROMs, hard drives, or any other machine-readable storage medium, wherein, when the program is loaded into and executed by a machine, such as a computer, the machine becomes an apparatus for practicing the disclosure.
In the case of program code execution on programmable computers, the computing device will generally include a processor, a storage medium readable by the processor (including volatile and non-volatile memory and/or storage elements), at least one input device, and at least one output device. Wherein the memory is configured to store program code; the processor is configured to perform the various methods of the present disclosure according to instructions in the program code stored in the memory.
By way of example, and not limitation, computer readable media comprise computer storage media and communication media. Computer-readable media include computer storage media and communication media. Computer storage media stores information such as computer readable instructions, data structures, program modules, or other data. Communication media typically embodies computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media. Combinations of any of the above are also included within the scope of computer readable media.
It should be appreciated that in the foregoing description of exemplary embodiments of the disclosure, various features of the disclosure are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the disclosure and aiding in the understanding of one or more of the various disclosed aspects. However, the disclosed method should not be construed as reflecting the intention that: i.e., the claimed disclosure requires more features than are expressly recited in each claim. Rather, as the following claims reflect, inventive aspects lie in less than all features of a single foregoing disclosed embodiment. Thus, the claims following the detailed description are hereby expressly incorporated into this detailed description, with each claim standing on its own as a separate embodiment of this disclosure.
Those skilled in the art will appreciate that the modules or units or components of the devices in the examples disclosed herein may be arranged in a device as described in this embodiment, or alternatively may be located in one or more devices different from the devices in this example. The modules in the foregoing examples may be combined into one module or may be further divided into a plurality of sub-modules.
Those skilled in the art will appreciate that the modules in the apparatus of the embodiments may be adaptively changed and disposed in one or more apparatuses different from the embodiments. The modules or units or components of the embodiments may be combined into one module or unit or component and, furthermore, they may be divided into a plurality of sub-modules or sub-units or sub-components. Any combination of all features disclosed in this specification (including any accompanying claims, abstract and drawings), and all of the processes or units of any method or apparatus so disclosed, may be used in combination, except insofar as at least some of such features and/or processes or units are mutually exclusive. Each feature disclosed in this specification (including any accompanying claims, abstract and drawings), may be replaced by alternative features serving the same, equivalent or similar purpose, unless expressly stated otherwise.
Furthermore, those skilled in the art will appreciate that while some embodiments described herein include some features but not others included in other embodiments, combinations of features of different embodiments are meant to be within the scope of the disclosure and form different embodiments. For example, in the following claims, any of the claimed embodiments can be used in any combination.
Furthermore, some of the embodiments are described herein as methods or combinations of method elements that may be implemented by a processor of a computer system or by other means of performing the functions. Thus, a processor with the necessary instructions for implementing the described method or method element forms a means for implementing the method or method element. Furthermore, the elements of the apparatus embodiments described herein are examples of the following apparatus: the apparatus is for performing functions performed by elements for purposes of this disclosure.
As used herein, unless otherwise specified the use of the ordinal terms "first," "second," "third," etc., to describe a general object merely denote different instances of like objects, and are not intended to imply that the objects so described must have a given order, either temporally, spatially, in ranking, or in any other manner.
While the disclosure has been described with respect to a limited number of embodiments, those skilled in the art, having benefit of the above disclosure, will appreciate that other embodiments are contemplated within the scope of the disclosure as described herein. Furthermore, it should be noted that the language used in the specification has been principally selected for readability and instructional purposes, and may not have been selected to delineate or circumscribe the disclosed subject matter. Accordingly, many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the appended claims. The disclosure of the present disclosure is illustrative, but not limiting, of the scope of the disclosure, which is defined by the appended claims.

Claims (8)

1. A method of multi-speaker speech synthesis, comprising:
acquiring multi-speaker voice data comprising at least two voice types;
performing data enhancement processing on the multi-speaker voice data;
inputting the multi-speaker voice data into a multi-speaker voice synthesis system for data training;
after training the multi-speaker speech synthesis system, inputting an instruction containing a designated speaker and a designated text to the multi-speaker speech synthesis system, and indicating the multi-speaker speech synthesis system to synthesize speech;
performing data enhancement processing on the multi-speaker voice data, including:
determining a specific sound type requiring data enhancement processing and the number of data enhancement processing according to the data quantity of the speaker voice data of each sound type;
converting the speaker voice data of the specific sound type into frequency domain data;
setting the energy values of one or more designated frequency intervals of the frequency domain data to zero in sequence according to the data enhancement processing times, and respectively generating a plurality of new speaker voice data of the specific sound type; wherein the specified frequency interval is divided in advance in the frequency range of the frequency domain data, and the specified frequency interval is divided in advance in the frequency range of the frequency domain data, including: the number of the designated frequency intervals is determined in advance according to the data enhancement processing times; dividing the specified frequency intervals in the frequency range of the frequency domain data according to the number of the specified frequency intervals.
2. The method of claim 1, wherein determining the number of data enhancement processes comprises:
and determining the data enhancement processing times according to the ratio of the data volume of the speaker voice data of the specific sound type and the data volume of the speaker voice data of other sound types.
3. The method of claim 1, wherein dividing the designated frequency interval within the frequency range of the frequency domain data comprises:
uniformly dividing the designated frequency interval in the frequency range of the frequency domain data; or alternatively, the process may be performed,
dividing the designated frequency interval unevenly within the frequency range of the frequency domain data according to human voice characteristics; or alternatively, the process may be performed,
and determining a division mode of a designated frequency interval with the optimal multi-speaker voice synthesis result by adopting a machine learning mode.
4. The method of claim 1, wherein the specific voice type of speaker voice data is converted into frequency domain data using a fast fourier transform FFT method.
5. The method of claim 1, wherein obtaining multi-speaker speech data comprising at least two sound types comprises:
multi-speaker speech data including at least two sound types of adult male, adult female, male and female boy sounds is acquired.
6. A multi-speaker speech synthesis apparatus, comprising:
the data acquisition unit is used for acquiring multi-speaker voice data containing at least two sound types;
a data enhancement unit for determining a specific sound type requiring data enhancement processing and the number of data enhancement processing times according to the data amount of the speaker voice data of each sound type; converting the speaker voice data of the specific sound type into frequency domain data; setting the energy values of one or more designated frequency intervals of the frequency domain data to zero in sequence according to the data enhancement processing times, and respectively generating a plurality of new speaker voice data of the specific sound type; wherein the specified frequency interval is divided in advance in the frequency range of the frequency domain data, and the specified frequency interval is divided in advance in the frequency range of the frequency domain data, including: the number of the designated frequency intervals is determined in advance according to the data enhancement processing times; dividing the specified frequency intervals in the frequency range of the frequency domain data according to the number of the specified frequency intervals;
the training unit is used for inputting the multi-speaker voice data into a multi-speaker voice synthesis system to perform data training;
and the voice synthesis unit is used for inputting an instruction containing a specified speaker and a specified text to the multi-speaker voice synthesis system after the multi-speaker voice synthesis system is trained, and instructing the multi-speaker voice synthesis system to synthesize voice.
7. A readable storage medium having executable instructions thereon, which when executed, cause a computer to perform the method of any of claims 1-5.
8. A computing device, comprising:
one or more processors;
a memory; and
one or more programs, wherein the one or more programs are stored in the memory and configured to perform the method of any of claims 1-5 by the one or more processors.
CN202010471223.1A 2020-05-28 2020-05-28 Multi-speaker voice synthesis method, device and computing equipment Active CN111681639B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010471223.1A CN111681639B (en) 2020-05-28 2020-05-28 Multi-speaker voice synthesis method, device and computing equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010471223.1A CN111681639B (en) 2020-05-28 2020-05-28 Multi-speaker voice synthesis method, device and computing equipment

Publications (2)

Publication Number Publication Date
CN111681639A CN111681639A (en) 2020-09-18
CN111681639B true CN111681639B (en) 2023-05-30

Family

ID=72452869

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010471223.1A Active CN111681639B (en) 2020-05-28 2020-05-28 Multi-speaker voice synthesis method, device and computing equipment

Country Status (1)

Country Link
CN (1) CN111681639B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112133278B (en) * 2020-11-20 2021-02-05 成都启英泰伦科技有限公司 Network training and personalized speech synthesis method for personalized speech synthesis model
CN113053353B (en) * 2021-03-10 2022-10-04 度小满科技(北京)有限公司 Training method and device of speech synthesis model
CN112992162B (en) * 2021-04-16 2021-08-20 杭州一知智能科技有限公司 Tone cloning method, system, device and computer readable storage medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1140871A (en) * 1995-02-22 1997-01-22 数字语音系统公司 Synthesis of speech using regenerated phase information
JP2002073067A (en) * 2000-09-05 2002-03-12 Victor Co Of Japan Ltd Method for decoding audio signal and decoder for audio signal
CN101369423A (en) * 2007-08-17 2009-02-18 株式会社东芝 Voice synthesizing method and device
JP2012145802A (en) * 2011-01-13 2012-08-02 Fujitsu Ltd Speech synthesizer and speech synthesis program
EP3544005A1 (en) * 2018-03-22 2019-09-25 Fraunhofer-Gesellschaft zur Förderung der angewandten Forschung e.V. Audio encoder, audio decoder, audio encoding method and audio decoding method for dithered quantization for frequency-domain speech and audio coding
CN110379414A (en) * 2019-07-22 2019-10-25 出门问问(苏州)信息科技有限公司 Acoustic model enhances training method, device, readable storage medium storing program for executing and calculates equipment

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
DE3017517A1 (en) * 1979-05-07 1980-11-13 Texas Instruments Inc LANGUAGE SYNTHESIS ARRANGEMENT
US7343289B2 (en) * 2003-06-25 2008-03-11 Microsoft Corp. System and method for audio/video speaker detection
US7991612B2 (en) * 2006-11-09 2011-08-02 Sony Computer Entertainment Inc. Low complexity no delay reconstruction of missing packets for LPC decoder
JP5159325B2 (en) * 2008-01-09 2013-03-06 株式会社東芝 Voice processing apparatus and program thereof
US20120029926A1 (en) * 2010-07-30 2012-02-02 Qualcomm Incorporated Systems, methods, apparatus, and computer-readable media for dependent-mode coding of audio signals

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1140871A (en) * 1995-02-22 1997-01-22 数字语音系统公司 Synthesis of speech using regenerated phase information
JP2002073067A (en) * 2000-09-05 2002-03-12 Victor Co Of Japan Ltd Method for decoding audio signal and decoder for audio signal
CN101369423A (en) * 2007-08-17 2009-02-18 株式会社东芝 Voice synthesizing method and device
JP2012145802A (en) * 2011-01-13 2012-08-02 Fujitsu Ltd Speech synthesizer and speech synthesis program
EP3544005A1 (en) * 2018-03-22 2019-09-25 Fraunhofer-Gesellschaft zur Förderung der angewandten Forschung e.V. Audio encoder, audio decoder, audio encoding method and audio decoding method for dithered quantization for frequency-domain speech and audio coding
CN110379414A (en) * 2019-07-22 2019-10-25 出门问问(苏州)信息科技有限公司 Acoustic model enhances training method, device, readable storage medium storing program for executing and calculates equipment

Also Published As

Publication number Publication date
CN111681639A (en) 2020-09-18

Similar Documents

Publication Publication Date Title
CN111681639B (en) Multi-speaker voice synthesis method, device and computing equipment
CN110232907B (en) Voice synthesis method and device, readable storage medium and computing equipment
CN106373580A (en) Singing synthesis method based on artificial intelligence and device
CN110379415B (en) Training method of domain adaptive acoustic model
WO2020248393A1 (en) Speech synthesis method and system, terminal device, and readable storage medium
US11264006B2 (en) Voice synthesis method, device and apparatus, as well as non-volatile storage medium
CN108206026A (en) Determine the method and device of audio content pitch deviation
US8131550B2 (en) Method, apparatus and computer program product for providing improved voice conversion
CN107240401B (en) Tone conversion method and computing device
US10971125B2 (en) Music synthesis method, system, terminal and computer-readable storage medium
CN107170464B (en) Voice speed changing method based on music rhythm and computing equipment
CN110379414B (en) Acoustic model enhancement training method and device, readable storage medium and computing equipment
CN106898339B (en) Song chorusing method and terminal
CN111128116B (en) Voice processing method and device, computing equipment and storage medium
JP2018537732A (en) Audio data processing method and apparatus
CN112289300B (en) Audio processing method and device, electronic equipment and computer readable storage medium
CN105321526A (en) Audio processing method and electronic device
CN110211562A (en) A kind of method of speech synthesis, electronic equipment and readable storage medium storing program for executing
CN111883100B (en) Voice conversion method, device and server
CN114842859A (en) Voice conversion method, system, terminal and storage medium based on IN and MI
CN113241054B (en) Speech smoothing model generation method, speech smoothing method and device
CN111966803B (en) Dialogue simulation method and device, storage medium and electronic equipment
CN114708849A (en) Voice processing method and device, computer equipment and computer readable storage medium
JP2004117662A (en) Voice synthesizing system
JP2020204683A (en) Electronic publication audio-visual system, audio-visual electronic publication creation program, and program for user terminal

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant