CN111862933A

CN111862933A - Method, apparatus, device and medium for generating synthesized speech

Info

Publication number: CN111862933A
Application number: CN202010698372.1A
Authority: CN
Inventors: 殷翔
Original assignee: Beijing ByteDance Network Technology Co Ltd
Current assignee: Beijing ByteDance Network Technology Co Ltd
Priority date: 2020-07-20
Filing date: 2020-07-20
Publication date: 2020-10-30

Abstract

Embodiments of the present disclosure disclose methods, apparatuses, devices and media for generating synthesized speech. One embodiment of the method comprises: synthesizing a second number of voiceprint features using the acquired first number of voiceprint features. Wherein the second number is greater than the first number. Then, a set of synthesized speech is generated based on the second number of voiceprint features and the obtained third number of texts. According to the embodiment, the synthetic voice with more voiceprint features is amplified according to the existing voiceprint features, so that a data base is provided for the amplification of the training sample with less linguistic data, and the accuracy of a voice synthesis model is improved.

Description

Method, apparatus, device and medium for generating synthesized speech

Technical Field

Embodiments of the present disclosure relate to the field of computer technology, and in particular, to a method and apparatus for generating synthesized speech.

Background

With the development of artificial intelligence technology, various machine learning models are applied more and more. For machine learning models, the amount of training data is often the primary factor determining the performance of the model.

For example, in the field of speech synthesis, for a language with a small amount of training data, the stability of a speech synthesis model obtained by training is poor and the recognition accuracy is low in the prior art, which is often caused by an insufficient amount of samples of the training data.

Disclosure of Invention

The present disclosure presents methods, apparatuses, devices and media for generating synthesized speech.

In a first aspect, an embodiment of the present disclosure provides a method for generating synthesized speech, the method including: obtaining a first number of voiceprint features; synthesizing a second number of voiceprint features using the first number of voiceprint features, wherein the second number is greater than the first number; acquiring a third number of texts; a set of synthesized speech is generated based on the second number of voiceprint features and the third number of texts.

In some embodiments, the synthesizing of the second number of voiceprint features using the first number of voiceprint features comprises: and selecting the voiceprint features from the first number of voiceprint features to perform proportional fusion operation to generate a second number of voiceprint features.

In some embodiments, the obtaining a first number of voiceprint features comprises: acquiring a first voice set, wherein the first voice set comprises voices with consistent languages; and inputting the voice in the first voice set to a pre-trained voiceprint recognition model to obtain a voiceprint feature corresponding to the input voice.

In some embodiments, the obtaining the first speech set includes: acquiring a second voice set, wherein the number of voices in the second voice set is greater than the number of voices in the first voice set; inputting the voice in the second voice set to a pre-trained voice recognition model to obtain a recognition text corresponding to the input voice; and selecting voice from the second voice set to generate a first voice set according to the obtained recognition rate of the recognized text.

In some embodiments, the selecting a speech from the second speech set according to the recognition rate of the obtained recognized text to generate the first speech set includes: in response to the fact that the recognition rate of the obtained recognition text is larger than a preset threshold value, inputting the voice corresponding to the obtained recognition text into a pre-trained voice quality detection model to obtain a quality score corresponding to the input voice; and selecting voice from the voice corresponding to the recognition text with the recognition rate larger than a preset threshold value to generate a first voice set according to the obtained quality score.

In a second aspect, embodiments of the present disclosure provide a method for training a speech synthesis model, the method comprising: obtaining a training sample set, wherein the training sample set comprises a synthetic speech set generated by the method of any embodiment of the first aspect and a third number of texts corresponding to the synthetic speech set; acquiring an initial speech synthesis model; and taking the text of the training sample in the training sample set as input, taking the synthesized voice corresponding to the input text as expected output, and training to obtain a voice synthesis model.

In some embodiments, the method further comprises: acquiring target voiceprint characteristics; selecting a training sample corresponding to the target voiceprint feature from the training sample set to generate a target training sample set; determining the voice synthesis model as an initial target voiceprint voice synthesis model; and taking the text of the target training sample in the target training sample set as input, taking the synthesized voice corresponding to the input text as expected output, and training to obtain the target voiceprint voice synthesis model.

In a third aspect, an embodiment of the present disclosure provides an apparatus for generating synthesized speech, the apparatus including: a first acquisition unit configured to acquire a first number of voiceprint features; a synthesizing unit configured to synthesize a second number of voiceprint features using the first number of voiceprint features, wherein the second number is greater than the first number; a second acquisition unit configured to acquire a third number of texts; a generating unit configured to generate a set of synthesized speech from the second number of voiceprint features and the third number of texts.

In a fourth aspect, an embodiment of the present disclosure provides an apparatus for training a speech synthesis model, the apparatus including: a third obtaining unit configured to obtain a training sample set, wherein the training sample set includes a synthesized speech set generated by the method according to any one of the embodiments of the first aspect and a third number of texts corresponding to the synthesized speech set; a fourth obtaining unit configured to obtain an initial speech synthesis model; and the training unit is configured to take the text of the training samples in the training sample set as input, take the synthesized voice corresponding to the input text as expected output, and train to obtain the voice synthesis model.

In a fifth aspect, an embodiment of the present disclosure provides a server for generating synthesized speech, including: one or more processors; a storage device having one or more programs stored thereon, which when executed by the one or more processors, cause the one or more processors to implement a method as in any one of the embodiments described above for generating synthesized speech or for training a speech synthesis model.

In a sixth aspect, embodiments of the present disclosure provide a computer-readable medium for generating synthesized speech, on which a computer program is stored, which when executed by a processor implements the method of any of the embodiments of the method for generating synthesized speech or the method for training a speech synthesis model as described above.

Embodiments of the present disclosure provide a method and apparatus for generating a synthesized voice, which synthesizes a second number of voiceprint features from the acquired first number of voiceprint features. Wherein the second number is greater than the first number. Then, a set of synthesized speech is generated based on the second number of voiceprint features and the obtained third number of texts. The method and the device realize the amplification of the synthesized voice with more voiceprint characteristics according to the existing voiceprint characteristics, thereby providing a data basis for the amplification of the training sample with less linguistic data.

Drawings

Other features, objects and advantages of the disclosure will become more apparent upon reading of the following detailed description of non-limiting embodiments thereof, made with reference to the accompanying drawings in which:

FIG. 1 is an exemplary system architecture diagram in which one embodiment of the present disclosure may be applied;

FIG. 2 is a flow diagram of one embodiment of a method for generating synthesized speech according to the present disclosure;

FIG. 3 is a schematic diagram of one application scenario of a method for generating synthesized speech according to the present disclosure;

FIG. 4 is a flow diagram of yet another embodiment of a method for training a speech synthesis model according to the present disclosure;

FIG. 5 is a schematic diagram illustrating the structure of one embodiment of an apparatus for generating synthesized speech according to the present disclosure;

FIG. 6 is a schematic block diagram illustrating one embodiment of an apparatus for training a speech synthesis model according to the present disclosure;

FIG. 7 is a schematic block diagram of a computer system suitable for use as a server for implementing embodiments of the present disclosure.

Detailed Description

The present disclosure is described in further detail below with reference to the accompanying drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the embodiments of the disclosure and that no limitations are intended to the embodiments of the disclosure. It should be further noted that, for convenience of description, only portions related to the embodiments of the present disclosure are shown in the drawings.

It should be noted that, in the present disclosure, the embodiments and features of the embodiments may be combined with each other without conflict. The present disclosure will be described in detail below with reference to the accompanying drawings in conjunction with embodiments.

Fig. 1 illustrates an exemplary system architecture 100 of an embodiment of a method for generating synthetic speech or an apparatus for generating synthetic speech to which embodiments of the present disclosure may be applied.

As shown in fig. 1, the system architecture 100 may include

terminal devices

101, 102, 103, a network 104, and a server 105. The network 104 serves as a medium for providing communication links between the

terminal devices

101, 102, 103 and the server 105. Network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.

A user may use the

terminal devices

101, 102, 103 to interact with the server 105 over the network 104 to receive or transmit data or the like. The

terminal devices

101, 102, 103 may have various communication client applications installed thereon, such as video playing software, news information applications, audio processing applications, recording software, web browser applications, shopping applications, search applications, instant messaging tools, mailbox clients, social platform software, and the like.

The

terminal apparatuses

101, 102, and 103 may be hardware or software. When the

terminal devices

101, 102, 103 are hardware, they may be various electronic devices having a microphone and supporting audio recording, including but not limited to smart phones, tablet computers, laptop portable computers, desktop computers, and the like. When the

terminal apparatuses

101, 102, 103 are software, they can be installed in the electronic apparatuses listed above. It may be implemented as multiple pieces of software or software modules (e.g., software or software modules used to provide distributed services) or as a single piece of software or software module. And is not particularly limited herein.

The server 105 may be a server providing various services, such as a background server providing support for audio processing type applications on the

terminal devices

101, 102, 103. The background server may analyze and perform other processing on the received data such as audio, and feed back a processing result (e.g., synthesized speech) to the terminal device. As an example, the server 105 may be a cloud server.

The server may be hardware or software. When the server is hardware, it may be implemented as a distributed server cluster formed by multiple servers, or may be implemented as a single server. When the server is software, it may be implemented as multiple pieces of software or software modules (e.g., software or software modules used to provide distributed services), or as a single piece of software or software module. And is not particularly limited herein.

It should also be noted that the method for generating synthesized speech provided by the embodiments of the present disclosure is generally performed by a server. Accordingly, the various parts (e.g. units, sub-units, modules, sub-modules) comprised by the means for generating the synthesized speech are typically provided in the server.

It should be understood that the number of terminal devices, networks, and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation. The system architecture may comprise only the electronic device (e.g. server or terminal device) on which the method for generating synthetic speech is running, when the electronic device on which the method for generating synthetic speech is running does not need to perform a data transfer with the other electronic device.

With continued reference to FIG. 2, a flow 200 of one embodiment of a method for generating synthesized speech according to the present disclosure is shown. The method for generating synthesized speech includes the steps of:

step 201, a first number of voiceprint features are obtained.

In this embodiment, an executing subject (e.g., a server shown in fig. 1) of the method for generating the synthesized voice may acquire the first number of voiceprint features from a local or other electronic device (e.g., the terminal device 101) or a software module (e.g., a software module for storing voiceprint features) by a wired connection manner or a wireless connection manner. Wherein, the voiceprint characteristics can be used for representing the tone of the speaker.

In some optional implementations of this embodiment, the executing body may obtain the first number of voiceprint features by:

in a first step, a first speech set is obtained.

In these implementations, the execution body may obtain the first voice set from a local or other electronic device (e.g., a database server) or a software module (e.g., a software module for storing the voice set) through a wired connection or a wireless connection. Wherein, the first voice set comprises voices with consistent languages. As an example, the first set of voices may include voices of 200 people speaking using indian dialect.

Optionally, the executing entity may obtain the first speech set by:

and S1, acquiring a second voice set.

In these implementations, the execution body may obtain the second speech set from a local or other electronic device (e.g., a database server) or a software module (e.g., a software module for storing the speech sets) through a wired connection or a wireless connection. The number of voices included in the second voice set is generally greater than the number of voices included in the first voice set.

And S2, inputting the voice in the second voice set into a pre-trained voice recognition model to obtain a recognition text corresponding to the input voice.

In these implementations, the executing entity may input the speech in the second speech set acquired in step S1 to a pre-trained speech recognition model, so as to obtain a recognized text corresponding to the input speech. The speech recognition model may include various models obtained through machine learning training and used for representing the correspondence between speech and recognized text.

And S3, selecting voice from the second voice set according to the recognition rate of the obtained recognition text to generate a first voice set.

In these implementations, the executing body may determine the recognition rate of the recognition text according to the recognition text obtained in step S2. Wherein, the above-mentioned recognition rate may include but is not limited to at least one of the following: WER (word Error Rate ), SER (Sentence Error Rate, Sennce Error Rate). The recognition rate may also include a recognition accuracy rate determined based on the error. Then, the executing entity may select voices from the second voice set acquired in step S1 in various ways to generate the first voice set. As an example, the execution subjects may be selected in order of low to high recognition error rate or high to low recognition accuracy rate. As another example, the execution subject may select the speech with the recognition error rate smaller than the preset error rate threshold or the recognition accuracy greater than the preset accuracy rate threshold.

Based on the optional implementation manner, the executing agent may select a speech with higher intelligibility as a training sample, so as to improve the quality of the training sample.

Optionally, the executing body may further generate the first speech set according to the following steps:

and S31, in response to the fact that the recognition rate of the obtained recognition text is larger than the preset threshold value, inputting the voice corresponding to the obtained recognition text into a pre-trained voice quality detection model, and obtaining a quality score corresponding to the input voice.

In these implementations, in response to determining that the recognition rate of the obtained recognized text is greater than a preset threshold (generally, the recognition error rate is less than the preset error rate threshold or the recognition accuracy rate is greater than the preset accuracy rate threshold), the executing entity may input the speech corresponding to the obtained recognized text to a pre-trained speech quality detection model, and obtain a quality score corresponding to the input speech. The voice quality detection model may include various models obtained through machine learning training and used for representing the correspondence between the voice and the quality score. The voice quality model can be obtained by training sample voices with different at least one of pronunciation fullness, voice continuity, signal-to-noise ratio and pitch fluctuation and corresponding sample labeling information representing the voice quality scores of the samples.

And S32, selecting voice from the voice corresponding to the recognition text with the recognition rate larger than a preset threshold value according to the obtained quality score to generate a first voice set.

In these implementations, according to the quality score obtained in step S31, the executing entity may select a speech from the speeches corresponding to the recognized text whose recognition rate is greater than a preset threshold (generally, the recognition error rate is less than the preset error rate threshold or the recognition accuracy rate is greater than the preset accuracy rate threshold) to generate the first speech set in various ways. As an example, the executive may first choose a first quasi-speech set based on the quality score. The voices in the first quasi-voice set include voices with quality scores exceeding a preset quality threshold, and may also include the first N voices with quality scores ranging from high to low. Then, the executing entity may select a speech from the first quasi-speech set according to the recognition rate to generate the first speech set in a manner consistent with the description of step S3.

Based on the optional implementation manner, the execution subject can select the voice with better tone quality as the training sample, so that the quality of the training sample is further improved.

And secondly, inputting the voice in the first voice set into a pre-trained voiceprint recognition model to obtain a voiceprint feature corresponding to the input voice.

In these implementations, the executing entity may input the speech in the first speech set acquired in the first step to a pre-trained voiceprint recognition model, so as to obtain a voiceprint feature corresponding to the input speech. The voiceprint recognition model may include various models obtained through machine learning training and used for representing the correspondence between the speech and the voiceprint features.

Based on the optional implementation manner, the execution subject may extract a voiceprint feature corresponding to the real voice by using a pre-trained voiceprint recognition model.

A second number of voiceprint features are synthesized 202 using the first number of voiceprint features.

In this embodiment, the executing entity may synthesize the second number of voiceprint features in various ways by using the first number of voiceprint features obtained in step 201. Wherein said second number is typically greater than said first number. As an example, the execution subject may select two voiceprint features (for example, a voiceprint feature a and a voiceprint feature B) from the first number of voiceprint features, and then select a larger value or a smaller value or any value between the smaller value and the larger value of elements in feature vectors corresponding to the voiceprint features a and B as elements in the newly generated voiceprint features. For example, embedding for voiceprint feature a can be (10,11,10), and embedding for voiceprint feature B can be (9,12, 8). Thus, the newly generated voiceprint features can include (10,12,10), (9,11,8), (9,12, 9).

In some optional implementation manners of this embodiment, the execution subject may select a voiceprint feature from the first number of voiceprint features to perform a proportional fusion operation, so as to generate a second number of voiceprint features. As an example, the execution subject may select two voiceprint features (e.g., voiceprint feature X and voiceprint feature Y) from the first number of voiceprint features to perform weighting (e.g., 70% voiceprint feature X + 30% voiceprint Y) to generate a new voiceprint feature. The executive can then get a large number of new voiceprint features.

Based on the optional implementation manner, the execution subject can implement a nearly smooth transition from each real voiceprint feature to another real voiceprint feature through weight value taking, and synthesize a large number of new voiceprint features in the process, thereby providing a solid data base for sample amplification.

Step 203, a third number of texts is obtained.

In this embodiment, the execution main body may obtain the third number of texts from a local or other electronic device (for example, the terminal device 101) or a software module (for example, a software module for storing voiceprint features) through a wired connection manner or a wireless connection manner. The text can be used to characterize the content of the synthesized speech.

Step 204, generating a synthesized speech set according to the second number of voiceprint features and the third number of texts.

In this embodiment, according to the second number of voiceprint features synthesized in step 202 and the third number of texts obtained in step 203, the executing body may generate a synthesized speech set by using various speech synthesis (TTS) methods. The voices in the synthesized voice set may include voices having timbres indicated by the second number of voiceprint features and containing contents indicated by the third number of texts.

With continued reference to fig. 3, fig. 3 is a schematic diagram of an application scenario of the method for generating synthesized speech according to the present embodiment. In the application scenario of fig. 3, a user 301 may click on a "synthesized speech" button through a terminal 302. The terminal 302 may send a speech synthesis request 304 to the server 303. The server 303 may obtain 200 different voiceprint features 305 according to the indication of the speech synthesis request 304 or preset rules. The server 303 then synthesizes 1000 voiceprint features 306 in various ways using the 200 different voiceprint features 305 described above. Server 303 may then retrieve 100 different texts 307. Finally, the server 303 may synthesize 1000 voiceprint features 306 and 100 different texts 307 to generate a set of synthesized speech 308. Wherein, the tone of the synthesized speech in the synthesized speech set may include the tone indicated by the 1000 voiceprint features. The contents of the synthesized speech in the synthesized speech set may include the contents indicated by the 100 texts. Optionally, the executing body may also feed back the generated synthesized voice set to the terminal 302.

The method provided by the above embodiment of the present disclosure synthesizes the first number of voiceprint features into the second number of voiceprint features, and thereby generates the synthesized speech set, thereby implementing amplification of synthesized speech with more voiceprint features according to the existing voiceprint features, and thus providing a data base for amplification of training samples with less corpus.

With further reference to FIG. 4, a flow 400 of yet another embodiment of a method for training a speech synthesis model is illustrated. The process 400 of the method for training a speech synthesis model includes the steps of:

step 401, a training sample set is obtained.

In this embodiment, an executing entity (e.g., the server shown in fig. 1) of the method for training the speech synthesis model may obtain the training sample set from a local or other electronic device (e.g., the database server 101) or a software module (e.g., a software module for storing the training sample set) through a wired connection or a wireless connection. The training sample set may include a synthesized speech set obtained by the method for generating synthesized speech and a third number of texts corresponding to the synthesized speech set.

It should be noted that the descriptions of the synthesized speech set and the third number of texts may be consistent with the descriptions of the foregoing embodiments, and are not repeated here.

Step 402, an initial speech synthesis model is obtained.

In this embodiment, the execution subject may obtain the initial speech synthesis model from a local or other electronic device (e.g., the database server 101) or a software module (e.g., a software module for storing the initial speech synthesis model) through a wired connection or a wireless connection. The initial speech synthesis model may include various Neural network models that can be used for speech synthesis, such as RNN (Recurrent Neural Networks).

And step 403, taking the text of the training sample in the training sample set as input, taking the synthesized voice corresponding to the input text as expected output, and training to obtain a voice synthesis model.

In this embodiment, the executing entity may use the text of the training sample obtained in step 401 as an input of the initial speech synthesis model obtained in step 402, and obtain a corresponding output result. Then, the execution main body may compare the obtained output result with the synthesized speech corresponding to the input text to generate a difference value. And adjusting the parameters of the initial speech synthesis model according to the obtained difference value, and taking the adjusted model as a new initial speech synthesis model to continue training. When the training end condition is satisfied, the execution agent may determine an initial speech synthesis model after parameter adjustment obtained by the training as the speech synthesis model.

In some optional implementations of this embodiment, the executing body may continue to perform the following steps:

firstly, obtaining target voiceprint characteristics.

In these implementations, the execution subject may obtain the target voiceprint characteristics from a local or other electronic device (e.g., the database server 101) or a software module (e.g., a software module for storing the target voiceprint characteristics) through a wired connection or a wireless connection. Wherein the target voiceprint feature can comprise any of the second number of voiceprint features.

And secondly, selecting a training sample corresponding to the target voiceprint feature from the training sample set to generate a target training sample set.

In these implementations, the executing entity may select a training sample corresponding to the target voiceprint feature obtained in the first step from the training sample set obtained in step 401 to generate a target training sample set.

And thirdly, determining the voice synthesis model as an initial target voiceprint voice synthesis model.

In these implementations, the executing entity may determine the speech synthesis model trained in step 403 as the initial target voiceprint speech synthesis model.

And fourthly, taking the text of the target training sample in the target training sample set as input, taking the synthesized voice corresponding to the input text as expected output, and training to obtain the target voiceprint voice synthesis model.

In these implementations, the executive may train the target voiceprint speech synthesis model in a manner consistent with step 403 described above.

Based on the optional implementation manner, the execution subject may continue training with a target training sample based on the speech synthesis model obtained by the training, so as to obtain a target voiceprint speech synthesis model capable of converting the text into a voice of a tone indicated by the target voiceprint.

It should be noted that, besides the above-mentioned contents, the embodiment of the present disclosure may also include the same or similar features and effects as the embodiment corresponding to fig. 2, and no further description is provided herein.

As can be seen from FIG. 4, the flow 400 of the method for training a speech synthesis model in the present embodiment highlights the step of training the speech synthesis model with training samples comprising the aforementioned set of synthesized speech. Therefore, the scheme described in this embodiment can train the speech recognition model by using more sample data, thereby playing an obvious effect of improving the stability and accuracy of the speech synthesis model of the corpus (such as the dialect of the small language) with obviously insufficient real sample size.

With further reference to fig. 5, as an implementation of the method shown in the above figures, the present disclosure provides an embodiment of an apparatus for generating synthesized speech, which corresponds to the embodiment of the method shown in fig. 2, and which may include the same or corresponding features as the embodiment of the method shown in fig. 2 and produce the same or corresponding effects as the embodiment of the method shown in fig. 2, in addition to the features described below. The device can be applied to various electronic equipment.

As shown in fig. 5, the apparatus 500 for generating synthesized speech of the present embodiment includes: a first acquisition unit 501, a synthesis unit 502, a second acquisition unit 503, and a generation unit 504. Wherein the first obtaining unit 501 is configured to obtain a first number of voiceprint features; a synthesizing unit 502 configured to synthesize a second number of voiceprint features using the first number of voiceprint features, wherein the second number is greater than the first number; a second acquiring unit 503 configured to acquire a third number of texts; a generating unit 504 configured to generate a set of synthesized speech from the second number of voiceprint features and the third number of texts.

In the present embodiment, in the apparatus 500 for generating synthesized speech: the specific processing of the first obtaining unit 501, the combining unit 502, the second obtaining unit 503 and the generating unit 504 and the technical effects thereof can refer to the related descriptions of step 201, step 202, step 203 and step 204 in the embodiment corresponding to fig. 2, and are not described herein again.

In some optional implementations of this embodiment, the synthesizing unit 502 may be further configured to select a voiceprint feature from the first number of voiceprint features to perform a proportional fusion operation, so as to generate a second number of voiceprint features.

In some optional implementation manners of this embodiment, the first obtaining unit 501 may include: a first acquisition subunit (not shown), a generation subunit (not shown). Wherein the first obtaining subunit may be configured to obtain the first speech set. The first speech set may include speeches of the same language. The generating subunit is configured to input the voices in the first voice set to a pre-trained voiceprint recognition model, and obtain a voiceprint feature corresponding to the input voices.

In some optional implementation manners of this embodiment, the first obtaining subunit may include: an acquisition module (not shown), a generation module (not shown), and a selection module (not shown). Wherein the obtaining module may be configured to obtain the second set of voices. The number of voices which can be included in the second voice set is larger than that of voices which are included in the first voice set. The generating module is configured to input the voices in the second voice set into a pre-trained voice recognition model, and obtain a recognition text corresponding to the input voices. The selecting module is configured to select a voice from the second voice set to generate a first voice set according to the recognition rate of the obtained recognition text.

In some optional implementation manners of this embodiment, the selecting module may include: generating sub-modules (not shown in the figure) and selecting sub-modules (not shown in the figure). The generation sub-module may be configured to, in response to determining that the recognition rate of the obtained recognition text is greater than a preset threshold, input the speech corresponding to the obtained recognition text to a pre-trained speech quality detection model, and obtain a quality score corresponding to the input speech. The selecting submodule may be configured to select, according to the obtained quality score, a voice from voices corresponding to the recognized text with the recognition rate greater than a preset threshold value to generate the first voice set.

The apparatus provided by the above embodiment of the present disclosure acquires a first number of voiceprint features through the first acquiring unit 501. Then, the synthesizing unit 502 synthesizes a second number of voiceprint features using the first number of voiceprint features. Wherein the second number is greater than the first number. After that, the second acquisition unit 503 acquires a third number of texts. Finally, the generating unit 504 generates a set of synthesized speech from the second number of voiceprint features and the third number of texts. The method and the device realize the amplification of the synthesized voice with more voiceprint characteristics according to the existing voiceprint characteristics, thereby providing a data base for the amplification of the training sample with less linguistic data and being beneficial to improving the accuracy of a voice synthesis model.

With further reference to fig. 6, as an implementation of the method shown in the above figures, the present disclosure provides an embodiment of an apparatus for training a speech synthesis model, the apparatus embodiment corresponds to the method embodiment shown in fig. 4, and the apparatus embodiment may further include the same or corresponding features as the method embodiment shown in fig. 4 and produce the same or corresponding effects as the method embodiment shown in fig. 4, in addition to the features described below. The device can be applied to various electronic equipment.

As shown in fig. 6, the apparatus 600 for training a speech synthesis model of the present embodiment includes: a third acquisition unit 601, a fourth acquisition unit 602, and a training unit 603. The third obtaining unit 601 is configured to obtain a training sample set, where the training sample set includes a synthesized speech set generated by the method for synthesizing speech and a third number of texts corresponding to the synthesized speech set; a fourth obtaining unit 602 configured to obtain an initial speech synthesis model; a training unit 603 configured to train a speech synthesis model by using the text of the training sample in the training sample set as an input and using the synthesized speech corresponding to the input text as an expected output.

In the present embodiment, in the apparatus 600 for training a speech synthesis model: for specific processing of the third obtaining unit 601, the fourth obtaining unit 602, and the training unit 603 and technical effects brought by the processing, reference may be made to the related descriptions of step 401, step 402, and step 403 in the embodiment corresponding to fig. 4, and no further description is given here.

In some optional implementations of this embodiment, the apparatus 600 for training a speech synthesis model described above may further include: a fifth acquiring unit (not shown), a selecting unit (not shown), a determining unit (not shown), and a training unit (not shown). Wherein the fifth acquiring unit may be configured to acquire the target voiceprint feature. The selecting unit may be configured to select a training sample corresponding to the target voiceprint feature from the training sample set to generate a target training sample set. The above-mentioned determining unit may be configured to determine the speech synthesis model as an initial target voiceprint speech synthesis model. The continuous training unit may be configured to train the text of the target training sample in the target training sample set as an input, and the synthesized speech corresponding to the input text as an expected output to obtain the target voiceprint speech synthesis model.

The apparatus provided by the above embodiment of the present disclosure acquires the training sample set through the third acquiring unit 601. Wherein the set of training samples comprises a set of synthesized speech as generated by the method for synthesizing speech and a third number of texts corresponding to the set of synthesized speech. Then, the fourth acquisition unit 602 acquires an initial speech synthesis model. Finally, the training unit 603 takes the text of the training sample in the training sample set as input, and takes the synthesized speech corresponding to the input text as expected output, and obtains a speech synthesis model through training. The method and the device realize training of the speech recognition model by using more sample data, thereby achieving an obvious effect of improving the accuracy of the speech synthesis model of the linguistic data (such as dialects in the small languages and the like) with obviously insufficient real sample size.

Referring now to FIG. 7, a block diagram of an electronic device (e.g., the server of FIG. 1) 700 suitable for use in implementing embodiments of the present disclosure is shown. The terminal device in the embodiments of the present disclosure may include, but is not limited to, a mobile terminal such as a mobile phone, a notebook computer, a digital broadcast receiver, a PDA (personal digital assistant), a PAD (tablet computer), a PMP (portable multimedia player), a vehicle terminal (e.g., a car navigation terminal), and the like, and a fixed terminal such as a digital TV, a desktop computer, and the like. The server shown in fig. 6 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present disclosure.

As shown in fig. 7, electronic device 700 may include a processing means (e.g., central processing unit, graphics processor, etc.) 701 that may perform various appropriate actions and processes in accordance with a program stored in a Read Only Memory (ROM)702 or a program loaded from storage 708 into a Random Access Memory (RAM) 703. In the RAM703, various programs and data necessary for the operation of the electronic apparatus 700 are also stored. The processing device 701, the ROM 702, and the RAM703 are connected to each other by a bus 704. An input/output (I/O) interface 705 is also connected to bus 704.

Generally, the following devices may be connected to the I/O interface 705: input devices 706 including, for example, a touch screen, touch pad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.; an output device 707 including, for example, a Liquid Crystal Display (LCD), a speaker, a vibrator, and the like; storage 708 including, for example, magnetic tape, hard disk, etc.; and a communication device 709. The communication means 709 may allow the electronic device 700 to communicate wirelessly or by wire with other devices to exchange data. While fig. 7 illustrates an electronic device 700 having various means, it is to be understood that not all illustrated means are required to be implemented or provided. More or fewer devices may alternatively be implemented or provided. Each block shown in fig. 7 may represent one device or may represent multiple devices as desired.

In particular, according to an embodiment of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such embodiments, the computer program may be downloaded and installed from a network via the communication means 709, or may be installed from the storage means 708, or may be installed from the ROM 702. The computer program, when executed by the processing device 701, performs the above-described functions defined in the methods of embodiments of the present disclosure. It should be noted that the computer readable medium described in the embodiments of the present disclosure may be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In embodiments of the disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In embodiments of the present disclosure, however, a computer readable signal medium may comprise a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, optical cables, RF (radio frequency), etc., or any suitable combination of the foregoing.

Computer program code for carrying out operations for embodiments of the present disclosure may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C + +, and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

In accordance with one or more embodiments of the present disclosure, there is provided a method for generating synthesized speech, the method comprising: obtaining a first number of voiceprint features; synthesizing a second number of voiceprint features using the first number of voiceprint features, wherein the second number is greater than the first number; acquiring a third number of texts; a set of synthesized speech is generated based on the second number of voiceprint features and the third number of texts.

According to one or more embodiments of the present disclosure, the method for generating synthesized speech provided by the present disclosure, the synthesizing a second number of voiceprint features by using a first number of voiceprint features includes: and selecting the voiceprint features from the first number of voiceprint features to perform proportional fusion operation to generate a second number of voiceprint features.

According to one or more embodiments of the present disclosure, in the method for generating synthesized speech provided by the present disclosure, the obtaining a first number of voiceprint features includes: acquiring a first voice set, wherein the first voice set comprises voices with consistent languages; and inputting the voice in the first voice set to a pre-trained voiceprint recognition model to obtain a voiceprint feature corresponding to the input voice.

According to one or more embodiments of the present disclosure, in the method for generating synthesized speech provided by the present disclosure, the obtaining the first speech set includes: acquiring a second voice set, wherein the number of voices in the second voice set is greater than the number of voices in the first voice set; inputting the voice in the second voice set to a pre-trained voice recognition model to obtain a recognition text corresponding to the input voice; and selecting voice from the second voice set to generate a first voice set according to the obtained recognition rate of the recognized text.

According to one or more embodiments of the present disclosure, in a method for generating synthesized speech provided by the present disclosure, selecting speech from a second speech set according to a recognition rate of a obtained recognition text to generate a first speech set includes: in response to the fact that the recognition rate of the obtained recognition text is larger than a preset threshold value, inputting the voice corresponding to the obtained recognition text into a pre-trained voice quality detection model to obtain a quality score corresponding to the input voice; and selecting voice from the voice corresponding to the recognition text with the recognition rate larger than a preset threshold value to generate a first voice set according to the obtained quality score.

In accordance with one or more embodiments of the present disclosure, there is provided a method for training a speech synthesis model, the method comprising: obtaining a training sample set, wherein the training sample set comprises a synthetic speech set generated by a method for synthesizing speech and a third number of texts corresponding to the synthetic speech set; acquiring an initial speech synthesis model; and taking the text of the training sample in the training sample set as input, taking the synthesized voice corresponding to the input text as expected output, and training to obtain a voice synthesis model.

In accordance with one or more embodiments of the present disclosure, the present disclosure provides a method for generating synthesized speech, the method further comprising: acquiring target voiceprint characteristics; selecting a training sample corresponding to the target voiceprint feature from the training sample set to generate a target training sample set; determining the voice synthesis model as an initial target voiceprint voice synthesis model; and taking the text of the target training sample in the target training sample set as input, taking the synthesized voice corresponding to the input text as expected output, and training to obtain the target voiceprint voice synthesis model.

In accordance with one or more embodiments of the present disclosure, there is provided an apparatus for generating synthesized speech, the apparatus including: a first acquisition unit configured to acquire a first number of voiceprint features; a synthesizing unit configured to synthesize a second number of voiceprint features using the first number of voiceprint features, wherein the second number is greater than the first number; a second acquisition unit configured to acquire a third number of texts; a generating unit configured to generate a set of synthesized speech from the second number of voiceprint features and the third number of texts.

In an apparatus for generating synthesized speech according to one or more embodiments of the present disclosure, the synthesis unit may be further configured to select a voiceprint feature from the first number of voiceprint features to perform a proportional fusion operation, and generate a second number of voiceprint features.

According to one or more embodiments of the present disclosure, in an apparatus for generating synthesized speech provided by the present disclosure, the first obtaining unit includes: the device comprises a first acquisition subunit, a second acquisition subunit and a third acquisition subunit, wherein the first acquisition subunit is configured to acquire a first voice set, and voices in the same language can be included in the first voice set; and the generating subunit is configured to input the voices in the first voice set into a pre-trained voiceprint recognition model, and obtain voiceprint features corresponding to the input voices.

According to one or more embodiments of the present disclosure, in an apparatus for generating synthesized speech provided by the present disclosure, the first obtaining subunit includes: the voice processing device comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is configured to acquire a second voice set, and the number of voices which can be included in the second voice set is greater than the number of voices which are included in a first voice set; the generating module is configured to input the voices in the second voice set into a pre-trained voice recognition model to obtain recognition texts corresponding to the input voices; and the selecting module is configured to select voices from the second voice set according to the recognition rate of the obtained recognition texts to generate the first voice set.

According to one or more embodiments of the present disclosure, the present disclosure provides an apparatus for generating synthesized speech, wherein: the selecting module comprises: a generation submodule configured to input speech corresponding to the obtained recognition text to a pre-trained speech quality detection model in response to determining that the recognition rate of the obtained recognition text is greater than a preset threshold, obtaining a quality score corresponding to the input speech; and the selecting submodule is configured to select voices from voices corresponding to the recognition texts with the recognition rates larger than a preset threshold value according to the obtained quality scores to generate a first voice set.

In accordance with one or more embodiments of the present disclosure, there is provided an apparatus for training a speech synthesis model, the apparatus comprising: a third obtaining unit configured to obtain a training sample set, wherein the training sample set includes a synthesized speech set generated by the method for synthesizing speech and a third number of texts corresponding to the synthesized speech set; a fourth obtaining unit configured to obtain an initial speech synthesis model; and the training unit is configured to take the text of the training samples in the training sample set as input, take the synthesized voice corresponding to the input text as expected output, and train to obtain the voice synthesis model.

In accordance with one or more embodiments of the present disclosure, the apparatus for training a speech synthesis model provided by the present disclosure further includes: a fifth acquiring unit configured to acquire a target voiceprint feature; the selecting unit is configured to select a training sample corresponding to the target voiceprint feature from the training sample set to generate a target training sample set; a determining unit configured to determine the speech synthesis model as an initial target voiceprint speech synthesis model; and the continuous training unit is configured to take the text of the target training sample in the target training sample set as input, take the synthesized voice corresponding to the input text as expected output, and train to obtain the target voiceprint voice synthesis model.

The units described in the embodiments of the present disclosure may be implemented by software or hardware. The described units may also be provided in a processor, and may be described as: a processor includes a first acquisition unit, a synthesis unit, a second acquisition unit, and a generation unit. Where the names of the units do not in some cases constitute a limitation of the unit itself, for example, the receiving unit may also be described as a "unit acquiring a first number of voiceprint features".

As another aspect, embodiments of the present disclosure also provide a computer-readable medium, which may be included in the electronic device described in the above embodiments; or may exist separately without being assembled into the electronic device. The computer readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to: obtaining a first number of voiceprint features; synthesizing a second number of voiceprint features using the first number of voiceprint features, wherein the second number is greater than the first number; acquiring a third number of texts; a set of synthesized speech is generated based on the second number of voiceprint features and the third number of texts.

The foregoing description is only exemplary of the preferred embodiments of the disclosure and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the invention in the embodiments of the present disclosure is not limited to the specific combination of the above-mentioned features, but also encompasses other embodiments in which any combination of the above-mentioned features or their equivalents is made without departing from the inventive concept as defined above. For example, the above features and (but not limited to) technical features with similar functions disclosed in the embodiments of the present disclosure are mutually replaced to form the technical solution.

Claims

1. A method for generating synthesized speech, comprising:

obtaining a first number of voiceprint features;

synthesizing a second number of voiceprint features using the first number of voiceprint features, wherein the second number is greater than the first number;

acquiring a third number of texts;

and generating a synthetic speech set according to the second number of voiceprint features and the third number of texts.

2. The method of claim 1, wherein said synthesizing a second number of voiceprint features using said first number of voiceprint features comprises:

and selecting the voiceprint features from the first number of voiceprint features to perform proportional fusion operation to generate a second number of voiceprint features.

3. The method of claim 1 or 2, wherein said obtaining a first number of voiceprint features comprises:

acquiring a first voice set, wherein the first voice set comprises voices with consistent languages;

and inputting the voice in the first voice set to a pre-trained voiceprint recognition model to obtain a voiceprint feature corresponding to the input voice.

4. The method of claim 3, wherein the obtaining a first set of voices comprises:

Acquiring a second voice set, wherein the number of voices in the second voice set is greater than the number of voices in the first voice set;

inputting the voice in the second voice set to a pre-trained voice recognition model to obtain a recognition text corresponding to the input voice;

and selecting voice from the second voice set to generate the first voice set according to the obtained recognition rate of the recognized text.

5. The method of claim 4, wherein the selecting a speech from the second speech set according to the recognition rate of the obtained recognized text to generate the first speech set comprises:

in response to the fact that the recognition rate of the obtained recognition text is larger than a preset threshold value, inputting the voice corresponding to the obtained recognition text into a pre-trained voice quality detection model to obtain a quality score corresponding to the input voice;

and selecting voice from the voice corresponding to the recognition text with the recognition rate larger than a preset threshold value to generate the first voice set according to the obtained quality score.

6. A method for training a speech synthesis model, comprising:

obtaining a set of training samples, wherein the set of training samples comprises a set of synthesized speech according to one of claims 1-5 and a third number of texts corresponding to the set of synthesized speech;

Acquiring an initial speech synthesis model;

and taking the text of the training sample in the training sample set as input, taking the synthesized voice corresponding to the input text as expected output, and training to obtain a voice synthesis model.

7. The method of claim 6, wherein the method further comprises:

acquiring target voiceprint characteristics;

selecting a training sample corresponding to the target voiceprint feature from the training sample set to generate a target training sample set;

determining the voice synthesis model as an initial target voiceprint voice synthesis model;

and taking the text of the target training sample in the target training sample set as input, taking the synthesized voice corresponding to the input text as expected output, and training to obtain a target voiceprint voice synthesis model.

8. An apparatus for generating synthesized speech, comprising:

a first acquisition unit configured to acquire a first number of voiceprint features;

a synthesizing unit configured to synthesize a second number of voiceprint features using the first number of voiceprint features, wherein the second number is greater than the first number;

a second acquisition unit configured to acquire a third number of texts;

a generating unit configured to generate a set of synthesized speech from the second number of voiceprint features and the third number of texts.

9. An apparatus for training a speech synthesis model, comprising:

a third obtaining unit configured to obtain a training sample set, wherein the training sample set comprises the set of synthesized speech according to one of claims 1 to 5 and a third number of texts corresponding to the set of synthesized speech;

a fourth obtaining unit configured to obtain an initial speech synthesis model;

and the training unit is configured to take the text of the training samples in the training sample set as input, take the synthesized voice corresponding to the input text as expected output, and train to obtain the voice synthesis model.

10. A server, comprising:

one or more processors;

a storage device having one or more programs stored thereon,

when executed by the one or more processors, cause the one or more processors to implement the method of any one of claims 1-7.

11. A computer-readable medium, on which a computer program is stored, wherein the program, when executed by a processor, implements the method of any one of claims 1-7.