CN111477210A

CN111477210A - Speech synthesis method and device

Info

Publication number: CN111477210A
Application number: CN202010253084.5A
Authority: CN
Inventors: 殷翔
Original assignee: Beijing ByteDance Network Technology Co Ltd
Current assignee: Beijing ByteDance Network Technology Co Ltd
Priority date: 2020-04-02
Filing date: 2020-04-02
Publication date: 2020-07-31

Abstract

The embodiment of the disclosure discloses a speech synthesis method and a speech synthesis device. One embodiment of the method comprises: acquiring a text; determining phonetic features of the text; and inputting the phonetic features into a pre-trained voice synthesis model corresponding to the target user to obtain voice data of the target user for the text, wherein the voice synthesis model is obtained by training based on singing data of the target user. This embodiment achieves the effect of simulating the user's speech data based on the user's singing data.

Description

Speech synthesis method and device

Technical Field

The embodiment of the disclosure relates to the technical field of computers, in particular to a speech synthesis method and device.

Background

Speech synthesis, also known as Text to Speech (Text to Speech) technology, is a technology for generating artificial Speech by mechanical and electronic means. The speech synthesis can convert any text information into standard smooth speech information. Speech synthesis relates to a plurality of subject technologies such as acoustics, linguistics, digital signal processing, computer science and the like, and is an important research direction in the field of Chinese information processing at present.

However, speech generated by existing speech synthesis methods typically provides only a number of different timbres. And one of the directions to be studied further is currently for synthesizing voices of timbre of a specified person.

Disclosure of Invention

The embodiment of the disclosure provides a speech synthesis method and device.

In a first aspect, an embodiment of the present disclosure provides a speech synthesis method, including: acquiring a text; determining phonetic features of the text; and inputting the phonetic features into a pre-trained voice synthesis model corresponding to the target user to obtain voice data of the target user for the text, wherein the voice synthesis model is obtained by training based on singing data of the target user.

In a second aspect, an embodiment of the present disclosure provides a speech synthesis apparatus, including: an acquisition unit configured to acquire a text; a determination unit configured to determine a phonetic feature of the text; and the synthesis unit is configured to input the phonetic features into a pre-trained speech synthesis model corresponding to the target user to obtain speech data of the target user for the text, wherein the speech synthesis model is obtained by training based on the singing data of the target user.

In a third aspect, an embodiment of the present disclosure provides an electronic device, including: one or more processors; storage means for storing one or more programs; when the one or more programs are executed by the one or more processors, the one or more processors are caused to implement the method as described in any implementation of the first aspect.

In a fourth aspect, embodiments of the present disclosure provide a computer-readable medium on which a computer program is stored, which computer program, when executed by a processor, implements the method as described in any of the implementations of the first aspect.

According to the voice synthesis method and device provided by the embodiment of the disclosure, the phonetic characteristics of any text are processed through the voice synthesis model corresponding to the target user trained in advance based on the singing data of the target user, so that the speaking data simulating that the target user pronounces any text can be synthesized, and the singing data simulating that the target user sings any text can also be synthesized. Therefore, for some users who are difficult to acquire a large amount of speaking data, the speaking data simulating any text spoken by the users can be conveniently generated according to the singing data of the users.

Drawings

Other features, objects and advantages of the disclosure will become more apparent upon reading of the following detailed description of non-limiting embodiments thereof, made with reference to the accompanying drawings in which:

FIG. 1 is an exemplary system architecture diagram in which one embodiment of the present disclosure may be applied;

FIG. 2 is a flow diagram for one embodiment of a method of speech synthesis according to the present disclosure;

FIG. 3 is a flow diagram of yet another embodiment of a speech synthesis method according to the present disclosure;

FIG. 4 is a schematic diagram of an application scenario of a speech synthesis method according to an embodiment of the present disclosure;

FIG. 5 is a schematic block diagram of one embodiment of a speech synthesis apparatus according to the present disclosure;

FIG. 6 is a schematic structural diagram of an electronic device suitable for use in implementing embodiments of the present disclosure.

Detailed Description

The present disclosure is described in further detail below with reference to the accompanying drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the relevant invention and not restrictive of the invention. It should be noted that, for convenience of description, only the portions related to the related invention are shown in the drawings.

It should be noted that, in the present disclosure, the embodiments and features of the embodiments may be combined with each other without conflict. The present disclosure will be described in detail below with reference to the accompanying drawings in conjunction with embodiments.

Fig. 1 shows an exemplary architecture 100 to which embodiments of the speech synthesis method or speech synthesis apparatus of the present disclosure may be applied.

As shown in fig. 1, the system architecture 100 may include

terminal devices

101, 102, 103, a network 104, and a server 105. The network 104 serves as a medium for providing communication links between the

terminal devices

101, 102, 103 and the server 105. Network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.

The

terminal devices

101, 102, 103 interact with a server 105 via a network 104 to receive or send messages or the like. Various client applications may be installed on the

terminal devices

101, 102, 103. Such as browser-type applications, search-type applications, speech processing-type applications, social platform software, information flow applications, and so forth.

The

terminal apparatuses

101, 102, and 103 may be hardware or software. When the

terminal devices

101, 102, 103 are hardware, they may be various electronic devices including, but not limited to, smart phones, tablet computers, e-book readers, laptop portable computers, desktop computers, and the like. When the

terminal apparatuses

101, 102, 103 are software, they can be installed in the electronic apparatuses listed above. It may be implemented as multiple pieces of software or software modules (e.g., multiple pieces of software or software modules to provide distributed services) or as a single piece of software or software module. And is not particularly limited herein.

The server 105 may be a server that provides various services, for example, a server that processes text sent by the

terminal apparatuses

101, 102, 103 to obtain voice data corresponding to the text. Further, the server may feed back the synthesized voice data to the

terminal devices

101, 102, 103.

Note that the text may be directly stored locally in the server 105, and the server 105 may directly extract and process the locally stored text, in which case the

terminal apparatuses

101, 102, and 103 and the network 104 may not be present.

It should be noted that the speech synthesis method provided by the embodiment of the present disclosure is generally executed by the server 105, and accordingly, the speech synthesis apparatus is generally disposed in the server 105.

It should be noted that the

terminal devices

101, 102, and 103 may also have a function of speech processing (for example, a speech synthesis tool or an application is installed). At this time, the

terminal apparatuses

101, 102, and 103 may perform speech synthesis on the text to generate speech data corresponding to the text. In this case, the speech synthesis method may be executed by the

terminal apparatuses

101, 102, and 103, and accordingly, the speech synthesis apparatus may be provided in the

terminal apparatuses

101, 102, and 103. At this point, the exemplary system architecture 100 may not have the server 105 and the network 104.

The server 105 may be hardware or software. When the server 105 is hardware, it may be implemented as a distributed server cluster composed of a plurality of servers, or may be implemented as a single server. When the server 105 is software, it may be implemented as multiple pieces of software or software modules (e.g., multiple pieces of software or software modules used to provide distributed services), or as a single piece of software or software module. And is not particularly limited herein.

It should be understood that the number of terminal devices, networks, and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.

With continued reference to FIG. 2, a flow 200 of one embodiment of a speech synthesis method according to the present disclosure is shown. The speech synthesis method comprises the following steps:

step 201, acquiring a text.

In the present embodiment, the execution subject of the speech synthesis method (e.g., the server 105 shown in fig. 1) may acquire the text from a local, other storage device (e.g., the

terminal devices

101, 102, 103, etc. shown in fig. 1), or a connected database, etc. The text may be a variety of texts, among others. For example, the text may be a piece of lyrics or a speech, etc. Also for example, the lyrics may be in a piece of blessing, etc.

Step 202, determining phonetic features of the text.

In this embodiment, the phonetic feature may refer to a feature of pronunciation aspect of the speech corresponding to the text. In other words, the phonetic features may characterize the pronunciation mechanism, pronunciation characteristics, and the like of each sound in the speech corresponding to the text. For example, the phonetic features may include intonation features for characterizing tone variation, duration features for characterizing tone vibration time, intensity features for characterizing tone vibration amplitude, and so forth.

Optionally, the phonetic features may include at least one of: phoneme characteristics, tone characteristics, pitch characteristics.

Wherein the phoneme is a minimum voice unit constituting a syllable divided according to natural attributes of the voice. Phonemes can be analyzed according to the pronunciation actions within a syllable, with one action constituting a phoneme. For example, the syllable "a" has one phoneme. The syllable "ai" has two phonemes. The phonetic features may be used to characterize the phonetic composition of the speech to which the text corresponds.

Wherein, the tone may refer to the change of the elevation of the sound. As an example, for mandarin chinese, the tones may include four tones of yin ping, yang ping, up ping, and down ping. Tonal features can be used to characterize the tones of individual words in speech to which text corresponds.

For example, pitch may be represented based on "Do, Re, Mi, Fa, So, L a, Si". Pitch features may be used to characterize the pitch of each tone in the speech to which the text corresponds.

Alternatively, the pitch signature may be represented using a MIDI file for the speech corresponding to the text. MIDI (Musical Instrument Digital Interface) is a protocol, or technology, established by manufacturers of electronic Musical instruments. The MIDI file can describe music information in bytes. For example, a MIDI file corresponding to a piece of music may have recorded therein pitch information of each tone in the piece of music.

Optionally, the phonetic features may also include prosodic features. The prosodic features can be used for characterizing the positions of the phonemes in the speech corresponding to the text in the words and/or sentences. Where a location may include an interior or a boundary. As an example, for the "lovely people", the "love" corresponds to a phoneme "i" which is just at the boundary of the word "lovely" and inside the sentence "lovely people".

It should be noted that the various phonetic features exemplified above are merely examples, and a skilled person can flexibly adopt various phonetic features according to actual application requirements and application scenarios. Meanwhile, the above example for each phonetic feature is also only an example, and each phonetic feature may also select different representation modes according to the actual application requirements and application scenarios.

In this embodiment, the phonetic features of the text can be obtained by using various existing speech processing methods. For example, when the phonetic features include phoneme features, the existing various phoneme representation methods may be used to determine the phoneme representation corresponding to each word in the text, so as to determine the phoneme features of the text according to the obtained phoneme representations corresponding to the text.

Step 203, inputting the phonetic features into a pre-trained speech synthesis model corresponding to the target user to obtain speech data of the target user for the text.

In this embodiment, the speech synthesis model may be used to generate speech data corresponding to the text according to the phonetic features of the text. The speech synthesis model corresponding to the target user may be used to generate the speech data of the target user for the text according to the phonetic features of the text, that is, the speech data of the text read or sung by the target user. The target user may be any user specified in advance. Where voice data may be used to characterize speech. The voice includes a voice corresponding to singing, a voice corresponding to speaking, and the like.

In this embodiment, the speech synthesis model corresponding to the target user may be obtained by training based on the singing data of the target user. The singing data of the target user can be voice data obtained by the target user through singing a song.

Alternatively, the singing data may be singing data. In this case, when a song that the target user has sung is acquired, the singing data corresponding to the song may be acquired by various existing vocal accompaniment separation techniques (i.e., techniques for separating vocal sounds from accompaniment).

Alternatively, the speech synthesis model may be trained by: acquiring a training sample set, wherein each training sample in the training sample set comprises singing data of a target user and phonetic features of the singing data; acquiring an initial speech synthesis model, wherein the initial speech synthesis model may be various existing open source speech synthesis models, or may be an initial speech synthesis model (for example, an initial speech synthesis model constructed by using frames such as Keras, Caffe, etc.) constructed by technicians according to actual application requirements; and training the initial speech synthesis model by using the training sample set. Specifically, the initial speech synthesis model may be trained by a machine learning method using the phonetic features in the training samples in the training sample set as the input of the initial speech synthesis model and the speech data corresponding to the input phonetic features as the expected output of the initial speech synthesis model, and the trained initial speech synthesis model may be determined as the speech synthesis model.

In some optional implementations of this embodiment, the speech synthesis model may include an acoustic feature prediction model and an vocoder. The acoustic feature prediction model can be used for predicting acoustic features corresponding to the input phonetic features. The vocoder may be used to generate speech data corresponding to the acoustic features of the input. Wherein the output of the acoustic feature prediction model can be used as the input of the vocoder.

Where acoustic features may refer to features of aspects of the speech signal. In other words, the acoustic features may characterize characteristics of a speech signal of speech to which the text corresponds, and the like. The skilled person can flexibly select the representation mode of the acoustic features according to the actual application requirements and application scenarios.

Optionally, the acoustic features may include at least one of MFCC (Mel Frequency cepstrum coefficient), L PCC (L initial Prediction Cepstral coefficient), P L P (Perceptual L initial Prediction coefficient).

At this time, after determining the phonetic features of the text, the phonetic features of the text may be input to the acoustic feature prediction model to obtain acoustic features corresponding to the input phonetic features, and then the obtained acoustic features may be input to the vocoder to obtain voice data corresponding to the input acoustic features.

The training data of the acoustic feature prediction model may include phonetic features and acoustic features of the singing data of the target user, and the training data of the vocoder may include the singing data of the target user and corresponding acoustic features.

As an example, the acoustic feature prediction model may be trained by: obtaining a first set of samples, wherein each sample in the first set of samples may include phonetic and acoustic features of speech data, and the speech data may include singing data of a target user; the method includes training a first initial model using a first set of samples, and determining the trained first initial model as an acoustic feature prediction model.

The granularity of the voice data corresponding to each sample in the first sample set can be flexibly set according to actual application requirements. For example, the granularity of the speech data corresponding to each sample may be speech data corresponding to one sentence. For another example, the granularity of the speech data corresponding to each sample may be speech data corresponding to a paragraph.

For example, the target user may be a singer. At this time, the singing data of the target user can be obtained from the song that the target user sings. Taking the granularity of the speech data corresponding to each sample as one sentence as an example, the speech data corresponding to each sentence in the song sung by the target user may correspond to one sample.

In this embodiment, for a voice data, the acoustic features and the phonetic features of the voice data can be obtained by using various existing voice processing methods.

For example, the phonetic features of a speech datum can be obtained by: and acquiring a text corresponding to the voice data, and acquiring the phonetic features of the voice data based on the text corresponding to the voice data. When the voice data is singing data, the text corresponding to the voice data may be lyrics of a song indicated by the singing data. The phonetic features of the speech data corresponding to the text can be obtained by utilizing various existing text processing and speech processing technologies.

It should be noted that, the above-mentioned extracting the phonetic features and the acoustic features of the voice data according to the voice data or the text corresponding to the voice data is a technology widely researched and applied at present, and is not described herein again.

The first initial model may be various types of untrained or trained artificial neural networks. For example, the first initial model may be various convolutional neural networks or the like. The first initial model may also be a model that combines a plurality of untrained or untrained artificial neural networks. For example, the first initial model may be obtained by combining some existing models for extracting acoustic features.

In training the first initial model using the first sample set, the training of the first initial model may be completed using various existing model training methods in machine learning.

For example, the training step of obtaining the acoustic feature prediction model by using the first sample set may specifically include:

step one, selecting a sample from a first sample set.

In this step, the method of selecting samples from the first sample set may be different according to different application requirements. For example, a predetermined number of samples may be randomly selected from the first sample set, or may be selected in a designated order. Wherein the preset number may be preset by a technician.

And step two, inputting the phonetic features in the selected sample into the first initial model to obtain corresponding predicted acoustic features.

In this step, it should be understood that, if the number of the selected samples is more than two, the phonetic features in each selected sample may be respectively input into the first initial model, so as to respectively obtain the predicted acoustic features corresponding to each input phonetic feature.

And step three, determining the value of the loss function according to the obtained predicted acoustic characteristics and the acoustic characteristics in the selected sample.

In this step, the value of the loss function may be determined based on the comparison of the obtained predicted acoustic features with the acoustic features in the selected sample. Wherein the loss function may be preset by a technician.

And step four, responding to the fact that the training of the first initial model is determined to be finished according to the value of the loss function, and determining the trained first initial model as the acoustic feature prediction model.

In this step, it may be determined whether the first initial model is trained according to the value of the loss function. The method for determining whether the first initial model is trained or not can be flexibly set by a technician according to an actual application scenario. For example, whether the first initial model is trained can be determined by judging the magnitude relation between the value of the loss function and a preset loss threshold.

And step five, responding to the fact that the first initial model is determined to be not trained completely according to the value of the loss function, adjusting parameters of the first initial model, reselecting samples from the first sample set, using the adjusted first initial model as the first initial model, and continuing to execute the training step.

In this step, parameters of the network layers of the first initial model may be adjusted according to the values of the loss function based on algorithms such as gradient descent and back propagation.

As an example, the vocoder may be trained by: obtaining a second sample set, wherein each sample in the second sample set can comprise singing data of a target user and corresponding acoustic features; training a second initial model using the second sample set, and determining the trained second initial model as a vocoder.

The granularity of the singing data corresponding to each sample in the second sample set can be flexibly set according to actual application requirements. For example, the granularity of the singing data corresponding to each sample may be the singing data corresponding to one sentence. For another example, the granularity of the singing data corresponding to each sample may be the singing data corresponding to one paragraph.

The second initial model may be various types of untrained or trained artificial neural networks. For example, the second initial model may be various deep learning networks, and the like. The second initial model may also be a model that combines a plurality of untrained or untrained artificial neural networks. For example, the second initial model may be obtained by combining various existing vocoders.

In training the second initial model using the second sample set, the training of the second initial model may be completed using various existing model training methods in machine learning.

For example, the training step of obtaining the vocoder by using the second sample set may specifically include:

step one, selecting samples from a second sample set.

In this step, the method of selecting samples from the second sample set may be different according to different application requirements. For example, a predetermined number of samples may be randomly selected from the second sample set, or may be selected in a designated order. Wherein the preset number may be preset by a technician.

And step two, inputting the acoustic characteristics in the selected sample into a second initial model to obtain corresponding output voice data.

In this step, it should be understood that, if the number of the selected samples is more than two, the acoustic features in each selected sample may be respectively input to the second initial model, and the output voice data corresponding to each input acoustic feature is respectively obtained.

And step three, determining the value of the loss function according to the obtained output voice data and the voice data in the selected sample.

In this step, the value of the loss function may be determined based on the comparison of the obtained output speech data with the speech data in the selected sample. Wherein the loss function may be preset by a technician.

And step four, in response to determining that the training of the second initial model is completed according to the value of the loss function, determining the trained second initial model as a vocoder.

In this step, it may be determined whether the second initial model is trained according to the value of the loss function. The method for determining whether the second initial model is trained or not can be flexibly set by a technician according to an actual application scenario. For example, it may be determined whether the second initial model is trained by determining a magnitude relationship between a value of the loss function and a preset loss threshold.

And step five, responding to the fact that the second initial model is determined to be not trained completely according to the value of the loss function, adjusting parameters of the second initial model, reselecting samples from the second sample set, using the adjusted second initial model as the second initial model, and continuing to execute the training step.

In this step, the parameters of the network layers of the second initial model may be adjusted according to the values of the loss function based on algorithms such as gradient descent and back propagation.

It should be noted that the speech data corresponding to each sample in the first sample set and the speech data corresponding to each sample in the second sample set may have an intersection, or may not have an intersection. In addition, the first sample set and the second sample set may be obtained in the same manner, or may be obtained in different manners.

It should be noted that the execution subjects of the acoustic feature prediction model and the training process of the vocoder may be the same or different. In addition, the execution subjects of the acoustic feature prediction model and the training process of the vocoder may be the same as or different from those of the speech synthesis method.

Since the vocoder is trained using the voice data (including singing data) and corresponding acoustic features of the target user. Therefore, the trained vocoder can be used to obtain voice data (such as singing data) similar to the target user, namely, the effect similar to the singing of the target user can be generated without the need of the target user to perform the singing personally.

In some optional implementations of this embodiment, the speech data corresponding to the samples in the first sample set may further include singing data of the users in the sample user set. Wherein the sample user set may be composed of some users that are specified in advance. The sample set of users may not include the target user. The singing data of each user in the sample set of users may also be speech data obtained by the user by singing a song.

By adding the singing data or the speaking data of other users except the target user in the first sample set, the distribution condition of the acoustic features (such as MFCC) corresponding to the singing data of other users can be referred at the same time when the acoustic feature prediction model is trained, so that the acoustic features of the singing data of the target user can be predicted more accurately, and the accuracy and the stability of the output result of the acoustic feature prediction model are improved.

It should be understood that any voice data in this disclosure should be processed with copyright rights for the voice data.

The speech synthesis method provided by the above embodiment of the present disclosure uses the singing data of the target user to pre-train the speech synthesis model for synthesizing the speech data of the target user according to the phonetic features, so that for any text, the speech synthesis model corresponding to the target user can be used to synthesize the speech data (such as singing data) similar to the target user according to the phonetic features of the text. Therefore, the voice data similar to the target user can be conveniently obtained without the need of recording the target user in person.

With further reference to FIG. 3, a flow 300 of yet another embodiment of a speech synthesis method is shown. The process 300 of the speech synthesis method includes the following steps:

step 301, a text is obtained.

Step 302, determining phonetic features of the text.

The specific implementation process of

steps

301 and 302 may refer to the related description of

steps

201 and 202 in the corresponding embodiment of fig. 2, and will not be described herein again.

Step 303, inputting the phonetic features into an acoustic feature prediction model included in a pre-trained speech synthesis model corresponding to the target user, so as to obtain acoustic features corresponding to the input phonetic features.

In this embodiment, the training data of the acoustic feature prediction model may include phonetic features and acoustic features of singing data of the target user, and may further include speech data of users in the sample user set.

Wherein the sample user set may be composed of some users that are specified in advance. The sample user set may not include the target user. The speech data for each user in the sample set of users may be speech data generated when the user speaks.

When training the acoustic feature prediction model corresponding to the target user, the phonetic features and the acoustic features of the singing data of the target user are used for training, and simultaneously the phonetic features and the acoustic features of the speaking data of other users except the target user are used, so that the acoustic features of the speaking data of the target user can be guided to be generated by simultaneously referring to the feature distribution conditions of the acoustic features (such as MFCC) corresponding to the speaking data of other users in the process of training the acoustic feature prediction model corresponding to the target user. Therefore, not only the singing data of the target user can be generated by using the trained voice synthesis model, but also the speaking data of the target user can be generated.

Optionally, the training data of the acoustic feature prediction model may include the vocal data of the users in the sample user set, along with the phonetic features and acoustic features of the singing data of the target user and the speech data of the users in the sample user set.

Therefore, the singing data or the speaking data of the target user can be generated by utilizing the trained voice synthesis model, and meanwhile, the correctness and the stability of the singing data or the speaking data generated by the voice synthesis model can be ensured.

Alternatively, when representing the phonetic features corresponding to the singing data and the speaking data respectively, the same phonetic feature representation mode can be adopted, but different representation symbols can be used for distinguishing, and the training of the acoustic feature prediction model can be further facilitated.

The specific execution process of the content other than the speech data of the users in the sample user set in step 303 may refer to the related description of step 203 in the corresponding embodiment of fig. 2, and is not repeated herein.

Step 304, inputting the obtained acoustic features to a pre-trained vocoder included in the speech synthesis model corresponding to the target user, so as to obtain the speech data of the target user for the text.

The specific execution process of step 304 may refer to the related description of step 203 in the corresponding embodiment of fig. 2, and is not repeated herein.

With continued reference to fig. 4, fig. 4 is a schematic diagram 400 of an application scenario of the speech synthesis method according to the present embodiment. In the application scenario of fig. 4, the acoustic feature prediction model 402 corresponding to the singer "X" may be obtained by pre-training with the phonetic features and the acoustic features determined from the singing data of a large number of songs of the singer "X" and the phonetic features and the acoustic features of the speech data of other persons, and at the same time, the vocoder 403 corresponding to the singer "X" may be obtained by training with the acoustic features corresponding to the singing data and the singing data of a large number of songs of the singer "X".

A text 401 expected to be enunciated by singer "X" is obtained. As shown by reference numeral 401 in the figure, the content of the text 401 may be a festival blessing phrase "happy birthday to everybody". Thereafter, a phoneme feature, a tone feature, and a pitch feature of the text 401 may be determined as the phonetic features of the text 401. The determined phonetic features of the text 401 may then be input to a pre-trained acoustic feature prediction model 402, resulting in a MFCC for the text 401. The resulting MFCC for the text 401 may then be input to a pre-trained vocoder 403 to obtain the speech data for the singer "X" to read the text 401.

The speech synthesis method provided by the above embodiment of the present disclosure utilizes the singing data of the target user and the speaking data and/or singing data of some other users except the target user to simultaneously train the acoustic feature prediction model corresponding to the target user, and utilizes the singing data of the target user to train the vocoder, so that when the acoustic feature prediction model corresponding to the target user is trained, the distribution of the acoustic features corresponding to the speaking data of the target user is guided by utilizing the feature distribution of the acoustic features corresponding to the speaking data of other users, and thus, by utilizing the speech synthesis model composed of the trained acoustic feature prediction model and the vocoder, not only the singing data simulating the target user can be obtained, but also the speaking data simulating the target user can be obtained. In addition, the method can obtain the speaking data similar to the target user without using a large amount of voice data of the target user as training data, namely without large cost consumption.

With further reference to fig. 5, as an implementation of the methods shown in the above-mentioned figures, the present disclosure provides an embodiment of a speech synthesis apparatus, which corresponds to the embodiment of the method shown in fig. 2, and which is particularly applicable to various electronic devices.

As shown in fig. 5, the speech synthesis apparatus 500 provided by the present embodiment includes an acquisition unit 501, a determination unit 502, and a synthesis unit 503. Wherein the obtaining unit 501 is configured to obtain a text; the determining unit 502 is configured to determine phonetic features of the text; the synthesis unit 503 is configured to input the phonetic features to a pre-trained speech synthesis model corresponding to the target user, which is trained based on the singing data of the target user, to obtain speech data of the target user for the text.

In the present embodiment, the speech synthesis apparatus 500: the specific processing of the obtaining unit 501, the determining unit 502, and the synthesizing unit 503 and the technical effects thereof can refer to the related descriptions of step 201, step 202, and step 203 in the corresponding embodiment of fig. 2, which are not repeated herein.

In the apparatus provided by the above embodiment of the present disclosure, the text is acquired by the acquisition unit; the determining unit determines phonetic features of the text; the synthesis unit inputs the phonetic features into a pre-trained speech synthesis model corresponding to the target user to obtain speech data of the target user for the text, wherein the speech synthesis model is obtained by training based on singing data of the target user. Therefore, the speech synthesis model pre-trained based on the singing data of the target user can be used for synthesizing the speaking data simulating any text spoken by the target user, and the singing data simulating any text sung by the target user can also be synthesized.

Referring now to FIG. 6, a schematic diagram of an electronic device (e.g., the server of FIG. 1) 600 suitable for use in implementing embodiments of the present disclosure is shown. The server shown in fig. 6 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present disclosure.

As shown in fig. 6, electronic device 600 may include a processing means (e.g., central processing unit, graphics processor, etc.) 601 that may perform various appropriate actions and processes in accordance with a program stored in a Read Only Memory (ROM)602 or a program loaded from a storage means 608 into a Random Access Memory (RAM) 603. In the RAM603, various programs and data necessary for the operation of the electronic apparatus 600 are also stored. The processing device 601, the ROM 602, and the RAM603 are connected to each other via a bus 604. An input/output (I/O) interface 605 is also connected to bus 604.

In general, input devices 606 including, for example, a touch screen, touch pad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc., output devices 607 including, for example, a liquid crystal display (L CD), speaker, vibrator, etc., storage devices 608 including, for example, magnetic tape, hard disk, etc., and communication devices 609.

In particular, according to an embodiment of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network via the communication means 609, or may be installed from the storage means 608, or may be installed from the ROM 602. The computer program, when executed by the processing device 601, performs the above-described functions defined in the methods of embodiments of the present disclosure.

It should be noted that the computer readable medium described in the embodiments of the present disclosure may be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In embodiments of the disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In embodiments of the present disclosure, however, a computer readable signal medium may comprise a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, optical cables, RF (radio frequency), etc., or any suitable combination of the foregoing.

Computer program code for carrying out operations for embodiments of the present disclosure may be written in any combination of one or more programming languages, including AN object oriented programming language such as Java, Smalltalk, C + +, and conventional procedural programming languages, such as the "C" programming language or similar programming languages.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

In accordance with one or more embodiments of the present disclosure, there is provided a speech synthesis method including: acquiring a text; determining phonetic features of the text; and inputting the phonetic features into a pre-trained voice synthesis model corresponding to the target user to obtain voice data of the target user for the text, wherein the voice synthesis model is obtained by training based on singing data of the target user.

According to one or more embodiments of the present disclosure, the speech synthesis model includes an acoustic feature prediction model and an vocoder; inputting the phonetic features into a pre-trained speech synthesis model corresponding to the target user to obtain speech data of the target user for the text, wherein the speech synthesis model comprises the following steps: inputting the phonetic features of the text into an acoustic feature prediction model to obtain acoustic features corresponding to the input phonetic features; and inputting the obtained acoustic features into a vocoder to obtain voice data corresponding to the input acoustic features.

According to one or more embodiments of the present disclosure, the training data of the acoustic feature prediction model includes phonetic features and acoustic features of the singing data of the target user, and the training data of the vocoder includes the singing data of the target user and corresponding acoustic features.

According to one or more embodiments of the present disclosure, the training data of the acoustic feature prediction model further includes phonetic features and acoustic features of the speech data and/or singing data of the users in the sample user set, the sample user set not including the target user.

According to one or more embodiments of the present disclosure, the above-mentioned phonetic features include at least one of: phoneme characteristics, tone characteristics, pitch characteristics.

According to one or more embodiments of the present disclosure, the pitch characteristics are expressed using MIDI files corresponding to the voices corresponding to the texts.

According to one or more embodiments of the present disclosure, the phonetic features further include prosodic features, where the prosodic features are used to characterize positions of words and/or sentences in which each phoneme in the speech corresponding to the text is located.

According to one or more embodiments of the present disclosure, there is provided a speech synthesis apparatus including: an acquisition unit configured to acquire a text; a determination unit configured to determine a phonetic feature of the text; and the synthesis unit is configured to input the phonetic features into a pre-trained speech synthesis model corresponding to the target user to obtain speech data of the target user for the text, wherein the speech synthesis model is obtained by training based on the singing data of the target user.

According to one or more embodiments of the present disclosure, the speech synthesis model includes an acoustic feature prediction model and an vocoder; the synthesis unit is further configured to input the phonetic features of the text into the acoustic feature prediction model, and obtain acoustic features corresponding to the input phonetic features; and inputting the obtained acoustic features into a vocoder to obtain voice data corresponding to the input acoustic features.

According to one or more embodiments of the present disclosure, the above-described pitch characteristics are represented using MIDI files corresponding to voice data.

According to one or more embodiments of the present disclosure, the phonetic features further include prosodic features, where the prosodic features are used to characterize positions of words and/or sentences in which the phonemes corresponding to the speech data are located.

The units described in the embodiments of the present disclosure may be implemented by software or hardware. The described units may also be provided in a processor, and may be described as: a processor includes an acquisition unit, a determination unit, and a synthesis unit. Where the names of these units do not in some cases constitute a limitation on the unit itself, for example, the retrieval unit may also be described as a "unit to retrieve text".

As another aspect, the present disclosure also provides a computer-readable medium. The computer readable medium may be embodied in the electronic device described above; or may exist separately without being assembled into the electronic device. The computer readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to: acquiring a text; determining phonetic features of the text; and inputting the phonetic features into a pre-trained voice synthesis model corresponding to the target user to obtain voice data of the target user for the text, wherein the voice synthesis model is obtained by training based on singing data of the target user.

The foregoing description is only exemplary of the preferred embodiments of the disclosure and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the invention in the embodiments of the present disclosure is not limited to the specific combination of the above-mentioned features, but also encompasses other embodiments in which any combination of the above-mentioned features or their equivalents is made without departing from the inventive concept as defined above. For example, the above features and (but not limited to) technical features with similar functions disclosed in the embodiments of the present disclosure are mutually replaced to form the technical solution.

Claims

1. A method of speech synthesis comprising:

acquiring a text;

determining phonetic features of the text;

and inputting the phonetic features into a pre-trained voice synthesis model corresponding to a target user to obtain voice data of the target user for the text, wherein the voice synthesis model is obtained by training based on singing data of the target user.

2. The method of claim 1, wherein the speech synthesis model comprises an acoustic feature prediction model and an vocoder; and

the inputting the phonetic features into a pre-trained speech synthesis model corresponding to a target user to obtain speech data of the target user for the text includes:

inputting the phonetic features of the text into the acoustic feature prediction model to obtain acoustic features corresponding to the input phonetic features;

and inputting the obtained acoustic features into the vocoder to obtain voice data corresponding to the input acoustic features.

3. The method of claim 2, wherein the training data of the acoustic feature prediction model includes phonetic features and acoustic features of the singing data of the target user, and the training data of the vocoder includes the singing data of the target user and corresponding acoustic features.

4. The method of claim 2, wherein the training data of the acoustic feature prediction model further comprises phonetic and acoustic features of the speech data and/or singing data of users in a sample set of users, the sample set of users not including the target user.

5. The method according to one of claims 1-4, wherein the phonetic features comprise at least one of: phoneme characteristics, tone characteristics, pitch characteristics.

6. The method of claim 5, wherein the pitch characteristics are represented using a MIDI file corresponding to a text corresponding to a voice.

7. The method of claim 5, wherein the phonetic features further comprise prosodic features, wherein the prosodic features are used for characterizing positions of words and/or sentences in which each phoneme in the speech corresponding to the text is located.

8. A speech synthesis apparatus, wherein the apparatus comprises:

an acquisition unit configured to acquire a text;

a determination unit configured to determine a phonetic feature of the text;

and the synthesis unit is configured to input the phonetic features into a pre-trained speech synthesis model corresponding to a target user to obtain speech data of the target user for the text, wherein the speech synthesis model is obtained by training based on singing data of the target user.

9. An electronic device, comprising:

one or more processors;

a storage device having one or more programs stored thereon;

when executed by the one or more processors, cause the one or more processors to implement the method of any one of claims 1-7.

10. A computer-readable medium, on which a computer program is stored which, when being executed by a processor, carries out the method according to any one of claims 1-7.