CN111369968B

CN111369968B - Speech synthesis method and device, readable medium and electronic equipment

Info

Publication number: CN111369968B
Application number: CN202010197182.1A
Authority: CN
Inventors: 殷翔
Original assignee: Beijing ByteDance Network Technology Co Ltd
Current assignee: Beijing ByteDance Network Technology Co Ltd
Priority date: 2020-03-19
Filing date: 2020-03-19
Publication date: 2023-10-13
Anticipated expiration: 2040-03-19
Also published as: CN111369968A

Abstract

The disclosure relates to a voice synthesis method, a device, a readable medium and an electronic device, wherein the method comprises the steps of obtaining a to-be-processed voice and a target text input by a user, wherein a voice fragment contains noise; extracting to-be-processed spectrum data from to-be-processed sound; generating target spectrum data corresponding to the tone of the sound to be processed and the target text according to the spectrum data to be processed and the target text; and synthesizing according to the target frequency spectrum data to obtain target sound corresponding to the target text. Therefore, clear pronunciation can be processed under the condition of any length of voice fragments which are input by a user and contain NOISE, namely, the effect of voice synthesis under multiple SNR (SIGNAL NOISE RATIO) environments is improved, the user does not need to input voice according to limited content or for a long time under the noiseless environment, and the complexity of voice synthesis of the user is greatly simplified on the premise of ensuring the voice synthesis effect.

Description

Speech synthesis method and device, readable medium and electronic equipment

Technical Field

The present disclosure relates to the field of speech synthesis technology, and in particular, to a speech synthesis method, apparatus, readable medium, and electronic device.

Background

In the prior art, in order to process the tone of a speaker, so as to automatically generate any voice by using the tone of the speaker, the speaker is usually required to input the voice in a quiet place, and if the speaker inputs the voice in a relatively noisy environment, and the input voice has large noise, the effect of voice synthesis is difficult to achieve ideal.

Disclosure of Invention

This summary is provided to introduce a selection of concepts in a simplified form that are further described below in the detailed description. This summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter.

In a first aspect, the present disclosure provides a method of speech synthesis, the method comprising:

acquiring a sound to be processed and a target text input by a user, wherein the sound to be processed is a sound fragment with any length uttered by the user, and the sound fragment contains noise;

extracting to-be-processed spectrum data from the to-be-processed sound;

generating target spectrum data corresponding to the tone of the sound to be processed and the target text according to the spectrum data to be processed and the target text;

And synthesizing the target frequency spectrum data to obtain target sound corresponding to the target text.

In a second aspect, the present disclosure also provides a speech synthesis apparatus, the apparatus comprising:

the system comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is used for acquiring a to-be-processed sound and a target text input by a user, the to-be-processed sound is a sound fragment with any length which is uttered by the user, and the sound fragment contains noise;

the extraction module is used for extracting to-be-processed spectrum data from the to-be-processed sound;

the processing module is used for generating target spectrum data corresponding to the tone of the sound to be processed and the target text according to the spectrum data to be processed and the target text;

and the synthesis module is used for synthesizing the target sound corresponding to the target text according to the target frequency spectrum data.

In a third aspect, the present disclosure also provides a computer readable medium having stored thereon a computer program which, when executed by a processing device, implements the steps of the method of the first aspect.

In a fourth aspect, the present disclosure also provides an electronic device, including:

a storage device having one or more computer programs stored thereon;

one or more processing means for executing the one or more computer programs in the storage means to effect the steps of the method of the first aspect.

According to the technical scheme, the voice of the user can be processed according to the voice fragments with any length input by the user, clear pronunciation can be processed under the condition that the voice to be processed input by the user contains NOISE, namely, the effect of voice synthesis under multiple SNR (SIGNAL NOISE RATIO) environments is improved, the text is sounded by the voice of the user, further, the text is read, the user does not need to input the voice according to limited contents, the user does not need to input the voice for a long time, the user does not need to input the voice under the NOISE-free environment, and the complexity of voice synthesis of the user is greatly simplified on the premise of guaranteeing the voice synthesis effect.

Additional features and advantages of the present disclosure will be set forth in the detailed description which follows.

Drawings

The above and other features, advantages, and aspects of embodiments of the present disclosure will become more apparent by reference to the following detailed description when taken in conjunction with the accompanying drawings. The same or similar reference numbers will be used throughout the drawings to refer to the same or like elements. It should be understood that the figures are schematic and that elements and components are not necessarily drawn to scale. In the drawings:

Fig. 1 is a flowchart illustrating a speech synthesis method according to an exemplary embodiment of the present disclosure.

Fig. 2 is a flowchart illustrating a speech synthesis method according to still another exemplary embodiment of the present disclosure.

Fig. 3 is a flowchart illustrating a speech synthesis method according to still another exemplary embodiment of the present disclosure.

Fig. 4 is a block diagram illustrating a structure of a voice synthesizing apparatus according to an exemplary embodiment of the present disclosure.

Fig. 5 shows a schematic structural diagram of an electronic device suitable for use in implementing embodiments of the present disclosure.

Detailed Description

Embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While certain embodiments of the present disclosure have been shown in the accompanying drawings, it is to be understood that the present disclosure may be embodied in various forms and should not be construed as limited to the embodiments set forth herein, but are provided to provide a more thorough and complete understanding of the present disclosure. It should be understood that the drawings and embodiments of the present disclosure are for illustration purposes only and are not intended to limit the scope of the present disclosure.

It will be appreciated that prior to using the technical solutions disclosed in the embodiments of the present disclosure, the user should be informed and authorized of the type, usage range, usage scenario, etc. of the personal information related to the present disclosure in an appropriate manner according to the relevant legal regulations.

For example, in response to receiving an active request from a user, a prompt is sent to the user to explicitly prompt the user that the operation it is requesting to perform will require personal information to be obtained and used with the user. Thus, the user can autonomously select whether to provide personal information to software or hardware such as an electronic device, an application program, a server or a storage medium for executing the operation of the technical scheme of the present disclosure according to the prompt information.

As an alternative but non-limiting implementation, in response to receiving an active request from a user, the manner in which the prompt information is sent to the user may be, for example, a popup, in which the prompt information may be presented in a text manner. In addition, a selection control for the user to select to provide personal information to the electronic device in a 'consent' or 'disagreement' manner can be carried in the popup window.

It will be appreciated that the above-described notification and user authorization process is merely illustrative and not limiting of the implementations of the present disclosure, and that other ways of satisfying relevant legal regulations may be applied to the implementations of the present disclosure.

It will be appreciated that the data (including but not limited to the data itself, the acquisition or use of the data) involved in the present technical solution should comply with the corresponding legal regulations and the requirements of the relevant regulations.

It should be understood that the various steps recited in the method embodiments of the present disclosure may be performed in a different order and/or performed in parallel. Furthermore, method embodiments may include additional steps and/or omit performing the illustrated steps. The scope of the present disclosure is not limited in this respect.

The term "including" and variations thereof as used herein are intended to be open-ended, i.e., including, but not limited to. The term "based on" is based at least in part on. The term "one embodiment" means "at least one embodiment"; the term "another embodiment" means "at least one additional embodiment"; the term "some embodiments" means "at least some embodiments. Related definitions of other terms will be given in the description below.

It should be noted that the terms "first," "second," and the like in this disclosure are merely used to distinguish between different devices, modules, or units and are not used to define an order or interdependence of functions performed by the devices, modules, or units.

It should be noted that references to "one", "a plurality" and "a plurality" in this disclosure are intended to be illustrative rather than limiting, and those of ordinary skill in the art will appreciate that "one or more" is intended to be understood as "one or more" unless the context clearly indicates otherwise.

The names of messages or information interacted between the various devices in the embodiments of the present disclosure are for illustrative purposes only and are not intended to limit the scope of such messages or information.

Fig. 1 is a flowchart illustrating a speech synthesis method according to an exemplary embodiment of the present disclosure. As shown in fig. 1, the method includes steps 101 to 104.

In step 101, a sound to be processed and a target text input by a user are acquired, wherein the sound to be processed is a sound clip with any length uttered by the user, and the sound clip contains noise. The sound to be processed input by the user can be a sound fragment with any length, the sounding content can also be any text, and the sound fragment can also be input in any noisy environment.

The sound to be processed can be input by a user in any mode, for example, the sound to be processed can be input by the user by directly recording on site, or can be a section of existing sound uploaded by the user. According to the length and content of the sound to be processed, which are required to be input by the user, the upper limit and/or the lower limit of the time length of the recommended input can be given according to the actual situation, for example, 10-30 minutes, so that the effect of speech synthesis is ensured.

The target text may be determined by a user's selection from among existing text templates, or may be determined automatically from a default text template if not selected by the user. That is, in the case where there are a plurality of text templates that can generate a voice, the user may select a text template to be generated by himself or herself after inputting a voice to be processed, or may directly use a default text template to generate a voice without selecting, or may select one text template randomly determined from all existing text templates as the target text in the case where the text template selection function is supported. In addition, the user can directly input any text as the target text.

In step 102, spectral data to be processed is extracted from the sound to be processed. The method for extracting the spectral data to be processed from the sound to be processed may be any method as long as the spectral data in the sound to be processed can be extracted. The spectral data to be processed may be, for example, a mel spectrum (Mel Bank Features).

In step 103, target spectrum data corresponding to the timbre of the sound to be processed and the target text is generated according to the spectrum data to be processed and the target text.

After the to-be-processed spectrum data of the to-be-processed sound is obtained, the to-be-processed sound can be subjected to tone processing according to the determined target text, and the target spectrum data is generated according to the target text.

In a possible implementation manner, the target spectrum data may be generated in step 103 by determining, according to the to-be-processed spectrum data and the target text, target spectrum data corresponding to timbres in the target text and the to-be-processed sound through a preset neural network acoustic model. The preset neural network acoustic model is obtained through training in advance, and can convert to-be-processed spectrum data in the user voice into target spectrum data according to the target text, so that the effect of sounding the target text by the user voice is achieved. The training data for training the acoustic model of the preset neural network may be a plurality of sound segments of a plurality of speakers and text information corresponding to the sound segments.

In a possible implementation manner, the determining, by the preset neural network acoustic model, the target spectrum data corresponding to the target text according to the to-be-processed spectrum data and the target text may be implemented in the following manner: training the preset neural network acoustic model according to the to-be-processed spectral data and the target text to obtain a target neural network acoustic model corresponding to the to-be-processed spectral data; and taking the target text as input of the target neural network acoustic model to obtain the target spectrum data.

That is, when the spectral data is converted according to a preset neural network acoustic model obtained by training a plurality of speaker sound segments, the target text and the spectral data to be processed extracted from the sound to be processed input by the user are respectively used as input and output of the preset neural network acoustic model to further train the preset neural network acoustic model, so that the target neural network acoustic model after learning the acoustic characteristics of the user sound is obtained. And generating the target spectrum data according to the target neural network acoustic model. Therefore, the preset neural network acoustic model is trained in real time in an adaptive mode, so that the target neural network acoustic model for processing the user sound can have a better processing effect on the sound of each user.

Specifically, in one possible implementation manner, the preset neural network acoustic model may include a preset speaker verification network sub-model and a preset conversion sub-model, and the training the preset neural network acoustic model according to the to-be-processed spectral data and the target text to obtain the target neural network acoustic model corresponding to the to-be-processed spectral data may be implemented by the following manner: taking the to-be-processed spectral data as the input of the preset speaker verification network submodel to extract a target speaker characterization vector corresponding to the to-be-processed spectral data; and taking the target speaker characterization vector and the target text as the input of the preset conversion sub-model, and taking the to-be-processed spectrum data as the output of the conversion sub-model for training so as to obtain the target neural network acoustic model.

The targeted speaker characterization vector may be, for example, speaker embedding. The target speaker characterization vector is information of an output of one of the layers of the pre-set speaker verification network sub-model, and is not a final output of the pre-set speaker verification network sub-model, for example, the speaker characterization vector speaker embedding generated for a dense layer that is one layer before a softmax layer in the pre-set speaker verification network sub-model.

After the target speaker characterization vector speaker embedding in the to-be-processed spectral data is obtained through the preset speaker verification network sub-model, the target speaker characterization vector speaker embedding and the target text are input into the preset conversion sub-model, and the to-be-processed spectral data is used as the output of the preset conversion sub-model to further train the preset conversion sub-model, so that the target neural network acoustic model including the trained preset conversion sub-model can be obtained.

When the preset conversion sub-model is further trained, the weight initialization can be performed according to the preset conversion sub-model obtained through the pre-training.

The training data of the sound segments of the plurality of speakers are also trained according to the method in the training process of the preset neural network acoustic model. For example, for any training data, the spectral data extracted from the sound segment in the training data is input into the preset speaker verification network sub-model, the speaker characterization vector speaker embedding output by one layer of network in the model is obtained and taken out, the speaker characterization vector speaker embedding and the text information corresponding to the sound segment in the training data are used as the input of the preset conversion sub-model, the spectral data extracted from the sound segment is used as the output, and the preset conversion sub-model is trained. And so on, training is performed for all training data.

In step 104, a target sound corresponding to the target text is synthesized according to the target spectrum data.

The target sound and wave waveform data may be, for example, wave waveform data, and the method of synthesizing the target sound according to the target spectrum data may be a method of passing through a predetermined neural network vocoder, which may be, for example, a wave net vocoder.

Fig. 2 is a flowchart illustrating a speech synthesis method according to still another exemplary embodiment of the present disclosure. As shown in fig. 2, the method includes steps 101, 102 and 104 shown in fig. 1, and further includes step 201.

In step 201, the spectral data to be processed and the target text are input into a preset neural network model to obtain target spectral data corresponding to the timbre of the sound to be processed and the target text, wherein training data of the preset neural network model includes a clear-sounding corpus and a noisy corpus corresponding to the clear-sounding corpus. The clear-sounding corpus and the noisy corpus in the training data can be, for example, 100 hours each, wherein the noisy corpus can be obtained by cutting from any video or audio, or can be obtained by adding various different noises to the clear-sounding corpus.

In a possible implementation manner, the training process of the preset neural network model includes countermeasure training for the pronunciation clarity corpus and the noisy corpus. In the countermeasure training, the noisy corpus can be used as a countermeasure sample for training, so that the preset neural network model can output clearer target spectrum data under the condition that noisy to-be-processed spectrum data is received.

Through the technical scheme, training data used in training the neural network model for processing the user voice comprises clear pronunciation corpus and noisy corpus, and countermeasure training is carried out on the clear pronunciation corpus and the noisy corpus, so that the neural network model obtained through training can be processed to obtain clear pronunciation under the condition that NOISE in the voice to be processed input by the user is large, namely the effect of synthesizing voices in multiple SNR (SIGNAL NOISE RATIO) environments is improved, and even if the user processes the voice in a noisy environment, a certain processing effect can be ensured.

In one possible implementation manner, the preset neural network model includes a preset speaker voice feature coding module, a preset text feature coding module and a preset decoding module. The preset speaker voice feature coding module may be a speech encoder, and the preset text feature coding module may be a text encoder. The preset speaker voice feature coding module is used for extracting speaker voice characterization speaker vector in the to-be-processed spectrum data, the preset text feature coding module is used for extracting text information in the target text, and the preset decoding module is used for outputting the target spectrum data according to the speaker voice characterization and the text information.

After extracting the to-be-processed spectrum data from the to-be-processed sound input by a user, inputting the to-be-processed spectrum data and the target text into the preset neural network model, wherein a preset speaker voice encoding module speaker encoder in the preset neural network model extracts speaker voice representation speaker from the to-be-processed spectrum data, and meanwhile, a preset text feature encoding module text encoder extracts text information from the target text, and finally, the speaker voice representation speaker vector extracted by the preset speaker voice encoding module speaker and the text information extracted by the preset text feature encoding module text encoder are spliced and input into the preset decoding module to generate the target spectrum data.

In one possible implementation, the preset neural network model is trained by: and performing countermeasure training on the output information of the preset speaker voice feature coding module so as to distinguish the clear pronunciation corpus and the noisy corpus, and marking the clear pronunciation corpus and the noisy corpus as clear labels and noisy labels respectively. That is, the challenge training is added to the output of the pre-set speaker speech feature encoding module speaker encoder. That is, in the training process of the preset neural network model, the preset speaker voice feature encoding module speeker encoder is used as a generating module in the countermeasure network, and a discriminating module is added after the output of the generating module to perform countermeasure training, so that the training of clearly speaker voice feature encoding module speeker encoder capacity is extracted from noisy corpus by the preset speaker voice feature encoding module speeker, and the preset neural network model is more suitable for a multi-noise environment. The clear label may be, for example, label 1 and the noisy label may be, for example, label 0.

In addition, in a possible implementation manner, the preset neural network model may be trained by the following manner: and carrying out speaker classification training on the output information of the preset speaker voice feature coding module so as to distinguish different speakers in the training data. The speaker classification training is carried out on the output information of the preset speaker voice feature coding module speaker encoder in the training process, so that the speaker voice feature speech feature coding module speaker extracted from the to-be-processed frequency spectrum data is reduced and is irrelevant to the speaking style of the speaker, and only the tone of the speaker is focused, thereby improving the processing effect of the preset neural network model on the sound tone.

Under the condition that the output of the preset speaker voice feature encoding module speeker is required to be subjected to countermeasure training and the output of the preset speaker voice feature encoding module speeker is required to be subjected to speaker classification training, the countermeasure training and the speaker classification training carried out on the output of the preset speaker voice feature encoding module speeker are simultaneously carried out in the training process.

Fig. 3 is a flowchart illustrating a speech synthesis method according to still another exemplary embodiment of the present disclosure. As shown in fig. 3, the method includes steps 301 and 302 in addition to steps 101, 102 and 104 as shown in fig. 1.

In step 301, the spectral data to be processed is noise reduced.

In step 302, target spectrum data corresponding to the timbre of the sound to be processed and the target text is generated according to the noise-reduced to-be-processed spectrum data and the target text.

As shown in fig. 3, after the to-be-processed spectrum data is extracted in step 102, noise reduction processing is performed on the to-be-processed spectrum data according to step 301, and then target spectrum data corresponding to the timbre of the to-be-processed sound and the target text is generated according to the to-be-processed spectrum data after noise reduction and the target text according to step 302. In this way, the effect of speech synthesis can be further improved from the viewpoint of reducing noise of the spectral data to be processed.

Fig. 4 is a block diagram illustrating a structure of a voice synthesizing apparatus 100 according to an exemplary embodiment of the present disclosure. As shown in fig. 4, the apparatus 100 includes: the acquiring module 10 is configured to acquire a to-be-processed sound and a target text input by a user, where the to-be-processed sound is a sound clip with an arbitrary length uttered by the user, and the sound clip contains noise; an extraction module 20, configured to extract to-be-processed spectrum data from the to-be-processed sound; a processing module 30, configured to generate target spectrum data corresponding to a timbre of the sound to be processed and the target text according to the spectrum data to be processed and the target text; and a synthesizing module 40, configured to synthesize a target sound corresponding to the target text according to the target spectrum data.

In one possible implementation, the processing module 30 is further configured to: and inputting the frequency spectrum data to be processed and the target text into a preset neural network model to obtain target frequency spectrum data corresponding to the tone of the sound to be processed and the target text, wherein training data of the preset neural network model comprises clear-sounding corpus and noisy corpus corresponding to the clear-sounding corpus.

In a possible implementation manner, the training process of the preset neural network model includes countermeasure training for the pronunciation clarity corpus and the noisy corpus.

In a possible implementation manner, the preset neural network model includes a preset speaker voice feature encoding module, a preset text feature encoding module and a preset decoding module, where the preset speaker voice feature encoding module is used to extract speaker voice representation in the to-be-processed spectrum data, the preset text feature encoding module is used to extract text information in the target text, and the preset decoding module is used to output the target spectrum data according to the speaker voice representation and the text information.

In one possible implementation, the preset neural network model is trained by: and performing countermeasure training on the output information of the preset speaker voice feature coding module so as to distinguish the clear pronunciation corpus and the noisy corpus, and marking the clear pronunciation corpus and the noisy corpus as clear labels and noisy labels respectively.

In one possible implementation, the preset neural network model is further trained by: and carrying out speaker classification training on the output information of the preset speaker voice feature coding module so as to distinguish different speakers in the training data.

In a possible implementation manner, before the processing module 30 generates target spectrum data corresponding to a timbre of the sound to be processed and the target text according to the spectrum data to be processed and the target text, the apparatus further includes: the noise reduction module is used for reducing noise of the spectrum data to be processed; the processing module is further used for generating target spectrum data corresponding to the tone of the sound to be processed and the target text according to the noise-reduced spectrum data to be processed and the target text.

Referring now to fig. 5, a schematic diagram of an electronic device 500 suitable for use in implementing embodiments of the present disclosure is shown. The terminal devices in the embodiments of the present disclosure may include, but are not limited to, mobile terminals such as mobile phones, notebook computers, digital broadcast receivers, PDAs (personal digital assistants), PADs (tablet computers), PMPs (portable multimedia players), in-vehicle terminals (e.g., in-vehicle navigation terminals), and the like, and stationary terminals such as digital TVs, desktop computers, and the like. The electronic device shown in fig. 5 is merely an example and should not be construed to limit the functionality and scope of use of the disclosed embodiments.

As shown in fig. 5, the electronic device 500 may include a processing means (e.g., a central processing unit, a graphics processor, etc.) 501, which may perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM) 502 or a program loaded from a storage means 508 into a Random Access Memory (RAM) 503. In the RAM 503, various programs and data required for the operation of the electronic apparatus 500 are also stored. The processing device 501, the ROM 502, and the RAM 503 are connected to each other via a bus 504. An input/output (I/O) interface 505 is also connected to bus 504.

In general, the following devices may be connected to the I/O interface 505: input devices 506 including, for example, a touch screen, touchpad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.; an output device 507 including, for example, a Liquid Crystal Display (LCD), a speaker, a vibrator, and the like; storage 508 including, for example, magnetic tape, hard disk, etc.; and communication means 509. The communication means 509 may allow the electronic device 500 to communicate with other devices wirelessly or by wire to exchange data. While fig. 5 shows an electronic device 500 having various means, it is to be understood that not all of the illustrated means are required to be implemented or provided. More or fewer devices may be implemented or provided instead.

In particular, according to embodiments of the present disclosure, the processes described above with reference to flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a non-transitory computer readable medium, the computer program comprising program code for performing the method shown in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network via the communication means 509, or from the storage means 508, or from the ROM 502. The above-described functions defined in the methods of the embodiments of the present disclosure are performed when the computer program is executed by the processing device 501.

It should be noted that the computer readable medium described in the present disclosure may be a computer readable signal medium or a computer readable storage medium, or any combination of the two. The computer readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples of the computer-readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this disclosure, a computer-readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In the present disclosure, however, the computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave, with the computer-readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, fiber optic cables, RF (radio frequency), and the like, or any suitable combination of the foregoing.

In some implementations, communications may be made using any currently known or future developed network protocol, such as HTTP (HyperText Transfer Protocol ), and may be interconnected with any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include a local area network ("LAN"), a wide area network ("WAN"), the internet (e.g., the internet), and peer-to-peer networks (e.g., ad hoc peer-to-peer networks), as well as any currently known or future developed networks.

The computer readable medium may be contained in the electronic device; or may exist alone without being incorporated into the electronic device.

The computer readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to: acquiring a sound to be processed and a target text input by a user, wherein the sound to be processed is a sound fragment with any length uttered by the user; extracting to-be-processed spectrum data from the to-be-processed sound; inputting the frequency spectrum data to be processed and the target text into a preset neural network model to obtain target frequency spectrum data corresponding to tone of the sound to be processed and the target text, wherein training data of the preset neural network model comprises clear-sounding corpus and noisy corpus corresponding to the clear-sounding corpus, and training of the preset neural network model comprises countermeasure training for the clear-sounding corpus and the noisy corpus; and synthesizing the target frequency spectrum data to obtain target sound corresponding to the target text.

Computer program code for carrying out operations of the present disclosure may be written in one or more programming languages, including, but not limited to, an object oriented programming language such as Java, smalltalk, C ++ and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computer (for example, through the Internet using an Internet service provider).

The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The modules described in the embodiments of the present disclosure may be implemented in software or hardware. The name of the module is not limited to the module itself in some cases, and for example, the acquisition module may also be described as "a module that acquires a sound to be processed and a target text input by a user".

The functions described above herein may be performed, at least in part, by one or more hardware logic components. For example, without limitation, exemplary types of hardware logic components that may be used include: a Field Programmable Gate Array (FPGA), an Application Specific Integrated Circuit (ASIC), an Application Specific Standard Product (ASSP), a system on a chip (SOC), a Complex Programmable Logic Device (CPLD), and the like.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

According to one or more embodiments of the present disclosure, example 1 provides a speech synthesis method, comprising:

extracting to-be-processed spectrum data from the to-be-processed sound;

According to one or more embodiments of the present disclosure, example 2 provides the method of example 1, the generating target spectral data corresponding to a timbre of the sound to be processed and the target text from the spectral data to be processed and the target text includes:

and inputting the frequency spectrum data to be processed and the target text into a preset neural network model to obtain target frequency spectrum data corresponding to the tone of the sound to be processed and the target text, wherein training data of the preset neural network model comprises clear-sounding corpus and noisy corpus corresponding to the clear-sounding corpus.

Example 3 provides the method of example 2, according to one or more embodiments of the present disclosure, wherein the training process of the preset neural network model includes countermeasure training for the pronunciation clarity corpus and the noisy corpus.

In accordance with one or more embodiments of the present disclosure, example 4 provides the method of example 3, the preset neural network model including a preset speaker speech feature encoding module, a preset text feature encoding module, and a preset decoding module, wherein,

the preset speaker voice characteristic coding module is used for extracting speaker voice characterization in the to-be-processed spectrum data, the preset text characteristic coding module is used for extracting text information in the target text, and the preset decoding module is used for outputting the target spectrum data according to the speaker voice characterization and the text information.

In accordance with one or more embodiments of the present disclosure, example 5 provides the method of example 4, the preset neural network model is trained by:

and performing countermeasure training on the output information of the preset speaker voice feature coding module so as to distinguish the clear pronunciation corpus and the noisy corpus, and marking the clear pronunciation corpus and the noisy corpus as clear labels and noisy labels respectively.

In accordance with one or more embodiments of the present disclosure, example 6 provides the method of example 5, the preset neural network model is trained by:

and carrying out speaker classification training on the output information of the preset speaker voice feature coding module so as to distinguish different speakers in the training data.

According to one or more embodiments of the present disclosure, example 7 provides the method of any one of examples 1 to 6, further comprising, before the step of generating target spectral data corresponding to a timbre of the sound to be processed and the target text from the spectral data to be processed and the target text:

noise reduction is carried out on the spectrum data to be processed;

the generating target spectrum data corresponding to the timbre of the sound to be processed and the target text according to the spectrum data to be processed and the target text comprises:

and generating target spectrum data corresponding to the tone of the sound to be processed and the target text according to the noise-reduced spectrum data to be processed and the target text.

According to one or more embodiments of the present disclosure, example 8 provides a speech synthesis apparatus, the apparatus comprising:

Example 9 provides the apparatus of example 8, according to one or more embodiments of the disclosure, the processing module further to: and inputting the frequency spectrum data to be processed and the target text into a preset neural network model to obtain target frequency spectrum data corresponding to the tone of the sound to be processed and the target text, wherein training data of the preset neural network model comprises clear-sounding corpus and noisy corpus corresponding to the clear-sounding corpus.

In accordance with one or more embodiments of the present disclosure, example 10 provides the apparatus of example 9, wherein training of the preset neural network model includes countermeasure training for the pronunciation clarity corpus and the noisy corpus.

In accordance with one or more embodiments of the present disclosure, example 11 provides the apparatus of example 10, the preset neural network model including a preset speaker speech feature encoding module, a preset text feature encoding module, and a preset decoding module, wherein,

Example 12 provides the apparatus of example 11, according to one or more embodiments of the present disclosure, the preset neural network model is trained by:

Example 13 provides the apparatus of example 12, according to one or more embodiments of the disclosure, the preset neural network model further trained by: and carrying out speaker classification training on the output information of the preset speaker voice feature coding module so as to distinguish different speakers in the training data.

In accordance with one or more embodiments of the present disclosure, example 14 provides the apparatus of any one of examples 8-13, before the processing module generates target spectral data corresponding to a timbre of the sound to be processed and the target text from the spectral data to be processed and the target text, the apparatus further comprising: the noise reduction module is used for reducing noise of the spectrum data to be processed; the processing module is further used for generating target spectrum data corresponding to the tone of the sound to be processed and the target text according to the noise-reduced spectrum data to be processed and the target text.

According to one or more embodiments of the present disclosure, example 15 provides a computer-readable medium having stored thereon a computer program which, when executed by a processing device, implements the steps of the method of any of examples 1-7.

Example 16 provides an electronic device according to one or more embodiments of the present disclosure, comprising:

a storage device having one or more computer programs stored thereon;

one or more processing means for executing the one or more computer programs in the storage means to implement the steps of the method of any of examples 1-7.

The foregoing description is only of the preferred embodiments of the present disclosure and description of the principles of the technology being employed. It will be appreciated by persons skilled in the art that the scope of the disclosure referred to in this disclosure is not limited to the specific combinations of features described above, but also covers other embodiments which may be formed by any combination of features described above or equivalents thereof without departing from the spirit of the disclosure. Such as those described above, are mutually substituted with the technical features having similar functions disclosed in the present disclosure (but not limited thereto).

Moreover, although operations are depicted in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order. In certain circumstances, multitasking and parallel processing may be advantageous. Likewise, while several specific implementation details are included in the above discussion, these should not be construed as limiting the scope of the present disclosure. Certain features that are described in the context of separate embodiments can also be implemented in combination in a single embodiment. Conversely, various features that are described in the context of a single embodiment can also be implemented in multiple embodiments separately or in any suitable subcombination.

Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are example forms of implementing the claims. The specific manner in which the various modules perform the operations in the apparatus of the above embodiments have been described in detail in connection with the embodiments of the method, and will not be described in detail herein.

Claims

1. A method of speech synthesis, the method comprising:

extracting to-be-processed spectrum data from the to-be-processed sound;

synthesizing according to the target frequency spectrum data to obtain target sound corresponding to the target text;

Inputting the frequency spectrum data to be processed and the target text into a preset neural network model to obtain target frequency spectrum data corresponding to the tone of the sound to be processed and the target text, wherein training data of the preset neural network model comprises clear-sounding corpus and noisy corpus corresponding to the clear-sounding corpus;

the training process of the preset neural network model comprises countermeasure training aiming at the clear pronunciation corpus and the noisy corpus, the preset neural network model comprises a preset speaker voice feature coding module, a preset text feature coding module and a preset decoding module, and the preset neural network model is trained in the following mode:

2. The method of claim 1, wherein the step of determining the position of the substrate comprises,

3. The method of claim 1, wherein the pre-set neural network model is trained by:

4. A method according to any one of claims 1-3, characterized in that before the step of generating target spectral data corresponding to the timbre of the sound to be processed and the target text from the spectral data to be processed and the target text, the method further comprises:

noise reduction is carried out on the spectrum data to be processed;

5. A speech synthesis apparatus, the apparatus comprising:

the synthesis module is used for synthesizing target sound corresponding to the target text according to the target frequency spectrum data;

the processing module is further configured to: inputting the frequency spectrum data to be processed and the target text into a preset neural network model to obtain target frequency spectrum data corresponding to the tone of the sound to be processed and the target text, wherein training data of the preset neural network model comprises clear-sounding corpus and noisy corpus corresponding to the clear-sounding corpus;

the training process of the preset neural network model comprises countermeasure training aiming at the clear pronunciation corpus and the noisy corpus, the preset neural network model comprises a preset speaker voice feature coding module, a preset text feature coding module and a preset decoding module, and the preset neural network model is trained in the following mode: and performing countermeasure training on the output information of the preset speaker voice feature coding module so as to distinguish the clear pronunciation corpus and the noisy corpus, and marking the clear pronunciation corpus and the noisy corpus as clear labels and noisy labels respectively.

6. A computer readable medium on which a computer program is stored, characterized in that the program, when being executed by a processing device, carries out the steps of the method according to any one of claims 1-4.

7. An electronic device, comprising:

a storage device having one or more computer programs stored thereon;

one or more processing means for executing the one or more computer programs in the storage means to implement the steps of the method of any of claims 1-4.