CN113838450A

CN113838450A - Audio synthesis and corresponding model training method, device, equipment and storage medium

Info

Publication number: CN113838450A
Application number: CN202110918198.1A
Authority: CN
Inventors: 高占杰; 李文杰
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2021-08-11
Filing date: 2021-08-11
Publication date: 2021-12-24
Anticipated expiration: 2041-08-11
Also published as: CN113838450B

Abstract

The present disclosure provides an audio synthesis and corresponding model training method, apparatus, device and storage medium, and relates to the technical field of artificial intelligence such as deep learning, speech technology and natural language processing. The specific implementation scheme is as follows: segmenting the designated audio to obtain a plurality of audio slices; extracting acoustic feature information corresponding to each audio slice by adopting a pre-trained acoustic feature extraction model to obtain a plurality of acoustic feature information; and synthesizing corresponding audio by adopting a pre-trained coder and a pre-trained decoder based on the plurality of acoustic characteristic information and the specified text. The present disclosure also provides a training scheme for acoustic models.

Description

Audio synthesis and corresponding model training method, device, equipment and storage medium

Technical Field

The present disclosure relates to the field of computer technologies, and in particular, to the field of artificial intelligence technologies such as deep learning, speech technology, and natural language processing, and in particular, to a method, an apparatus, a device, and a storage medium for audio synthesis and corresponding model training.

Background

Speech synthesis is increasingly used. Current speech synthesis mainly consists of two parts, an acoustic model for converting text or phonemes into audio and a vocoder for converting the audio into speech.

In the prior art, training of an acoustic model generally requires more data, for example, to learn feature information such as the tone and/or style of a certain speaker, if less than 20 voices of the speaker are collected, the acoustic model is difficult to accurately learn the feature information such as the tone and/or style of the speaker.

Disclosure of Invention

The present disclosure provides an audio synthesis and corresponding model training method, apparatus, device and storage medium.

According to an aspect of the present disclosure, there is provided an audio synthesizing method, wherein the method includes:

segmenting the designated audio to obtain a plurality of audio slices;

extracting acoustic feature information corresponding to each audio slice by adopting a pre-trained acoustic feature extraction model to obtain a plurality of acoustic feature information;

and synthesizing corresponding audio by adopting a pre-trained coder and a pre-trained decoder based on the plurality of acoustic characteristic information and the specified text.

According to another aspect of the present disclosure, there is provided a training method of an acoustic model, wherein the method includes:

segmenting at least one training audio frequency in the collected multiple training audio frequencies to obtain multiple training audio frequency slices;

arranging and combining the plurality of training audios, the collected corresponding training texts and the plurality of training audio slices to obtain a plurality of pieces of training data;

and training an acoustic model by using the plurality of pieces of training data.

According to still another aspect of the present disclosure, there is provided an audio synthesizing apparatus, wherein the apparatus includes:

the segmentation module is used for segmenting the specified audio to obtain a plurality of audio slices;

the extraction module is used for extracting acoustic feature information corresponding to each audio slice by adopting a pre-trained acoustic feature extraction model to obtain a plurality of acoustic feature information;

and the synthesis module is used for synthesizing corresponding audio by adopting a pre-trained coder and a pre-trained decoder based on the plurality of acoustic characteristic information and the specified text.

According to still another aspect of the present disclosure, there is provided an apparatus for training an acoustic model, wherein the apparatus includes:

the segmentation module is used for segmenting at least one training audio frequency in the collected multiple training audio frequencies to obtain multiple training audio frequency slices;

the combination module is used for arranging and combining the plurality of training audios, the collected corresponding training texts and the plurality of training audio slices to obtain a plurality of pieces of training data;

and the training module is used for training the acoustic model by adopting the plurality of pieces of training data.

According to still another aspect of the present disclosure, there is provided an electronic device including:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores instructions executable by the at least one processor to cause the at least one processor to perform the method of the aspects and any possible implementation described above.

According to yet another aspect of the present disclosure, there is provided a non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of the above aspect and any possible implementation.

According to yet another aspect of the present disclosure, there is provided a computer program product comprising a computer program which, when executed by a processor, implements the method of the aspect and any possible implementation as described above.

According to the technology disclosed by the invention, the synthesis efficiency of the audio can be effectively improved.

It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.

Drawings

The drawings are included to provide a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

FIG. 1 is a schematic diagram according to a first embodiment of the present disclosure;

FIG. 2 is a schematic diagram according to a second embodiment of the present disclosure;

FIG. 3 is a schematic diagram according to a third embodiment of the present disclosure;

FIG. 4 is a schematic diagram according to a fourth embodiment of the present disclosure;

FIG. 5 is a schematic diagram according to a fifth embodiment of the present disclosure;

FIG. 6 is a schematic diagram according to a sixth embodiment of the present disclosure;

FIG. 7 is a schematic diagram according to a seventh embodiment of the present disclosure;

FIG. 8 is a schematic diagram according to an eighth embodiment of the present disclosure;

fig. 9 is a block diagram of an electronic device for implementing an audio synthesis method or a training method of an acoustic model according to an embodiment of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of the embodiments of the disclosure are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

It is to be understood that the described embodiments are only a few, and not all, of the disclosed embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments disclosed herein without making any creative effort, shall fall within the protection scope of the present disclosure.

It should be noted that the terminal device involved in the embodiments of the present disclosure may include, but is not limited to, a mobile phone, a Personal Digital Assistant (PDA), a wireless handheld device, a Tablet Computer (Tablet Computer), and other intelligent devices; the display device may include, but is not limited to, a personal computer, a television, and the like having a display function.

In addition, the term "and/or" herein is only one kind of association relationship describing an associated object, and means that there may be three kinds of relationships, for example, a and/or B, which may mean: a exists alone, A and B exist simultaneously, and B exists alone. In addition, the character "/" herein generally indicates that the former and latter related objects are in an "or" relationship.

FIG. 1 is a schematic diagram according to a first embodiment of the present disclosure; as shown in fig. 1, the present embodiment provides an audio synthesizing method, which may specifically include the following steps:

s101, segmenting designated audio to obtain a plurality of audio slices;

s102, extracting acoustic feature information corresponding to each audio slice by adopting a pre-trained acoustic feature extraction model to obtain a plurality of acoustic feature information;

and S103, synthesizing corresponding audio by adopting a pre-trained coder and a pre-trained decoder based on the plurality of acoustic characteristic information and the specified text.

The execution subject of the audio synthesis method of the embodiment may be an audio synthesis apparatus, which may be an electronic entity or may also adopt a software integrated application, for synthesizing the audio of the speaker description specified text of the specified audio.

The audio described in this embodiment, such as the designated audio and the synthesized audio, may be in the form of Mel-frequency (Mel) spectrum, which may also be referred to as Mel audio.

Specifically, when the specified audio is sliced, equal-length slicing may be performed, for example, every 100 frames is taken as a unit for slicing, and if the last section of audio slices left after slicing from front to back is less than 100 frames, the audio slices may be used alone without affecting the use effect. Or when in segmentation, unequal-length segmentation can be performed to obtain a plurality of audio slices. In this embodiment, no matter which way the audio slices are segmented is used to extract the acoustic feature information, in order to ensure the accuracy of the extracted acoustic feature information, in this embodiment, when the audio slices are segmented, it may be ensured that the length of each audio slice obtained by segmentation is greater than a preset threshold, and the length of the preset threshold may be set according to experience, for example, the length may be 90 frames, 100 frames, or other integer number frames.

The acoustic feature extraction model of the present embodiment is used to extract acoustic feature information in an audio slice, such as timbre information or style information or other acoustic feature information. Since the style information needs longer audio to be extracted accurately, the style information needs longer audio slice length than the tone information. The speaker's timbre information is only a small section of the audio slice that can be extracted. That is, when the number of audio slices is small, the scheme of the present embodiment is more effective when the acoustic feature information is timbre.

When the method is used, each audio slice is input into the acoustic feature extraction model, and the acoustic feature extraction model can predict and output the acoustic feature information of each audio slice. For multiple audio slices, multiple acoustic feature information may be obtained.

Finally, an encoder and a decoder are adopted, and corresponding audio can be synthesized based on the acoustic characteristic information and the specified text, namely the audio is synthesized to be the audio of the specified text spoken by the speaker of the specified audio.

The acoustic feature extraction model, the encoder and the decoder trained in advance in this embodiment may be collectively referred to as an acoustic model, which is used to implement audio that a speaker specifying the audio will specify text. In training, the acoustic feature extraction model, the acoustic model formed by the encoder and decoder may be trained together.

In the audio synthesis method of the embodiment, a plurality of audio slices are obtained by segmenting the designated audio; extracting acoustic characteristic information corresponding to each audio slice by adopting a pre-trained acoustic characteristic extraction model to obtain a plurality of acoustic characteristic information; and then on the basis of a plurality of acoustic characteristic information and appointed texts, a pre-trained encoder and a pre-trained decoder are adopted to synthesize corresponding audio, so that the accuracy of the extracted acoustic characteristic information of the appointed audio can be effectively improved, the accuracy of the synthesized audio is effectively improved, and the audio synthesis efficiency is improved.

FIG. 2 is a schematic diagram according to a second embodiment of the present disclosure; the audio synthesis method of the present embodiment further introduces the technical solution of the present application in more detail on the basis of the technical solution of the embodiment shown in fig. 1. As shown in fig. 2, the audio synthesis method of this embodiment may specifically include the following steps:

s201, segmenting designated audio to obtain a plurality of audio slices;

s202, extracting acoustic feature information corresponding to each audio slice by adopting a pre-trained acoustic feature extraction model to obtain a plurality of acoustic feature information;

s203, acquiring target acoustic characteristic information based on the plurality of acoustic characteristic information;

specifically, the target acoustic feature information may be obtained by calculation based on a plurality of acoustic feature information in a mathematical calculation manner. For example, the following formula may be adopted, and an average value M of a plurality of acoustic feature information is taken as the target acoustic feature information: m ═ x₁+x₂+x₃+……+x_n) N, wherein x₁、x₂、x₃、……、x_nRespectively representing n pieces of acoustic characteristic information; m represents an average value of n pieces of acoustic feature information.

For another example, the following formula may be adopted, and the variance of the n pieces of acoustic feature information is calculated first; s²＝(x₁-M)²+(x₂-M)²+(x₃-M)²+……+(x_n-M)²(ii) a Then obtaining the difference with the variance from the n pieces of acoustic characteristic informationTwo acoustic feature information with minimum values; and taking the average value of the two acquired acoustic characteristic information as target acoustic characteristic information. Or, the target acoustic feature information may also be obtained in other manners, for example, n pieces of acoustic feature information may be divided into a plurality of groups, an average value of the acoustic feature information of each group is calculated, and finally, an average value of the average values of the acoustic feature information of all the groups is taken as the target acoustic feature information. Or in practical application, other mathematical calculations may also be adopted to obtain the target acoustic feature information, which is not described in detail herein.

Taking the tone as an example, in the manner of this embodiment, the specified audio is segmented to obtain a plurality of audio slices, and then the acoustic feature extraction model extracts tone feature information corresponding to each audio slice to obtain a plurality of tone feature information. And acquiring target tone characteristic information based on the plurality of tone characteristic information, wherein the accuracy of the extracted target tone characteristic information can be further improved compared with the method of directly extracting the tone characteristic information from the specified audio.

S204, encoding the specified text by using an encoder to obtain encoding characteristic information of the specified text;

and S205, decoding by adopting a decoder to obtain audio based on the target acoustic characteristic information and the coding characteristic information of the specified text.

The encoder and decoder of the present embodiment are also neural network models, and may be implemented, for example, using a recurrent neural network model.

When the encoding device works, the encoder receives the input specified text and encodes the specified text to obtain corresponding encoding characteristic information. The decoder receives the input target acoustic characteristic information and the coding characteristic information of the specified text, and then decodes the target acoustic characteristic information and the coding characteristic information to obtain the audio of the specified text spoken by the speaker of the specified audio. Corresponding speech may also be subsequently synthesized based on the audio.

The tone color feature information and the text encoding feature information of the present embodiment are both expressed in the form of vectors.

By adopting the technical scheme, the audio synthesis method of the embodiment can effectively improve the accuracy of the extracted acoustic characteristic information of the specified audio, thereby effectively improving the accuracy of the synthesized audio and improving the synthesis efficiency of the audio.

FIG. 3 is a schematic diagram according to a third embodiment of the present disclosure; as shown in fig. 3, the present embodiment provides a training method of an acoustic model, which specifically includes the following steps:

s301, segmenting at least one training audio frequency in the collected multiple training audio frequencies to obtain multiple training audio frequency slices;

s302, arranging and combining a plurality of training audios, the collected corresponding training texts and a plurality of training audio slices to obtain a plurality of pieces of training data;

and S303, training the acoustic model by adopting a plurality of pieces of training data.

The executing subject of the training method of the acoustic model of this embodiment may be a training apparatus of the acoustic model, and the apparatus may be an electronic entity or may also be an application adopting software integration, and is used for training the acoustic model. The acoustic model of this embodiment is the acoustic model including the acoustic feature extraction model, the encoder, and the decoder of the embodiment shown in fig. 1 or fig. 2.

In this embodiment, the training data obtained after the permutation and combination may include a training text, a training audio corresponding to the training text, and a training audio slice, where the training audio slice is used to provide acoustic features such as tone and/or style during the training process.

In this embodiment, the collected training audio and the training text are in a one-to-one correspondence relationship, but in a practical application scenario, the data volume of the collected training audio is still very limited. The acoustic model in this embodiment is used to learn acoustic characteristics of a speaker, such as the timbre and/or the style, especially the timbre, and the requirement on the length of the audio data is very low, for example, only an audio slice including 100 frames can realize the learning of the timbre. Therefore, in order to increase the data amount of the acoustic model training data, in this embodiment, at least one training audio may be segmented to obtain a plurality of training audio slices. Further, in order to increase the number of training audio slices, it is preferable that each training audio is sliced. Then, the training audio slices, the training audios and the corresponding training texts are arranged and combined to enrich training data. If 10 training audios and corresponding training texts are acquired, taking segmentation of each training audio as an example, each training audio may be segmented to obtain 8 training audio slices, and a total of 80 training audio slices may be obtained. Thus, 800 pieces of training data can be obtained by permutation and combination, and the data volume of the training data can be greatly enriched. In the training data obtained in this way, the training audio slice is decoupled from the training audio, that is, in the same training data, the training audio slice is not necessarily a segment in the training audio, or the text information corresponding to the training audio slice and the training audio does not have an intersection.

Finally, the acoustic model is trained using the plurality of pieces of training data. Because the training data volume is expanded by multiple times, the acoustic model can fully learn the acoustic characteristic information in the training data, and the training effect of the acoustic model is improved.

By adopting the technical scheme, the training method of the acoustic model can greatly enrich the data volume of the training data, further enable the acoustic model to fully learn the acoustic characteristic information in the training data, and effectively improve the efficiency of synthesizing audio by the trained acoustic model.

FIG. 4 is a schematic diagram according to a fourth embodiment of the present disclosure; the training method of the acoustic model of this embodiment further describes the technical solution of this application in more detail on the basis of the technical solution of the embodiment shown in fig. 3, and as shown in fig. 4, the training method of the acoustic model of this embodiment may specifically include the following steps:

s401, collecting a plurality of training audios and training texts corresponding to the training audios;

s402, segmenting at least one of the collected multiple pieces of training audio to obtain multiple pieces of training audio;

s403, arranging and combining a plurality of training audios, the collected corresponding training texts and a plurality of training audio slices to obtain a plurality of pieces of training data;

referring to the above description of the embodiment shown in fig. 3, the training audio and the training text of the embodiment are in a one-to-one correspondence relationship, that is, the training audio is the audio of the training text to be trained by the speaker. In the training data obtained by permutation and combination, the training audio slice and the training text are completely deconstructed and have no relation.

S404, for each piece of training data, extracting acoustic feature information of a training feature audio slice in the training data by adopting an acoustic feature extraction model in an acoustic model to obtain training acoustic feature information;

s405, coding a training text in the training data by adopting a coder in the acoustic model to obtain coding characteristic information of the training text;

s406, decoding based on the training acoustic characteristic information and the training text coding characteristic information by adopting a decoder in the acoustic model to obtain a prediction audio;

the above steps in the training process are similar to the principle of the use process of the embodiment shown in fig. 2, and are not described again here.

S407, constructing a loss function based on the prediction audio and the training audio in the training data;

s408, judging whether the loss function is converged, and if not, executing the step S409; if yes, go to step S410;

s409, adjusting parameters of the acoustic feature extraction model, the encoder and the decoder to enable the loss function to tend to converge; returning to the step S404, selecting the next piece of training data and continuing training;

in this embodiment, when the parameters of the acoustic feature extraction model, the encoder, and the decoder are adjusted, the adjustment may be performed based on a gradient descent method. In particular, the three models may be adjusted separately, e.g., one model for every N rounds of training. The three models can also be adjusted simultaneously. In any case, the adjustment method may be any adjustment method as long as the trend toward the decrease of the loss function is adjusted.

S410, detecting whether a training termination condition is met, if so, finishing the training, determining parameters of an acoustic model comprising an acoustic feature extraction model, an encoder and a decoder, and further determining the acoustic model; otherwise, if not, returning to the step S404 to select the next piece of training data and continuing training;

the training termination condition of this embodiment may be that the training frequency reaches a preset frequency threshold, or may be that the loss function is always converged during training of consecutive rounds. When the training termination condition is met, the acoustic model can be considered to be trained to be mature and can be used, and the training can be terminated at the moment.

FIG. 5 is a schematic diagram according to a fifth embodiment of the present disclosure; the present embodiment provides an audio synthesizing apparatus 500, including:

a segmentation module 501, configured to segment a specified audio to obtain multiple audio slices;

an extraction module 502, configured to extract acoustic feature information corresponding to each audio slice by using a pre-trained acoustic feature extraction model to obtain multiple pieces of acoustic feature information;

and a synthesizing module 503, configured to synthesize corresponding audio by using a pre-trained encoder and decoder based on the plurality of acoustic feature information and the specified text.

The audio synthesis apparatus 500 of this embodiment implements the implementation principle and technical effect of audio synthesis by using the above modules, which are the same as the implementation of the related method embodiment described above, and reference may be made to the description of the related method embodiment in detail, which is not repeated herein.

FIG. 6 is a schematic diagram according to a sixth embodiment of the present disclosure; the audio synthesizing apparatus 500 of the present embodiment further describes the technical solution of the present application in more detail based on the technical solution of the embodiment shown in fig. 5.

As shown in fig. 6, in the audio synthesizing apparatus 500 of the present embodiment, the synthesizing module 503 includes:

an obtaining unit 5031 configured to obtain target acoustic feature information based on the plurality of acoustic feature information;

an encoding unit 5032, configured to encode the specified text by using an encoder to obtain encoding characteristic information of the specified text;

a decoding unit 5033, configured to decode, by using a decoder, the audio based on the target acoustic feature information and the encoding feature information of the specified text.

Further optionally, wherein the obtaining unit 5031 is configured to:

and calculating to obtain target acoustic characteristic information based on the plurality of acoustic characteristic information by adopting a mathematical calculation mode.

FIG. 7 is a schematic diagram according to a seventh embodiment of the present disclosure; the embodiment provides a training apparatus 700 for an acoustic model, including:

a segmentation module 701, configured to segment at least one of the acquired multiple pieces of training audio to obtain multiple training audio slices;

a combination module 702, configured to arrange and combine multiple training audios and the acquired corresponding training texts with multiple training audio slices to obtain multiple pieces of training data;

a training module 703, configured to train the acoustic model with a plurality of pieces of training data.

The implementation principle and technical effect of the training of the acoustic model implemented by the modules in the training apparatus 700 of the acoustic model of this embodiment are the same as those of the related method embodiments, and reference may be made to the description of the related method embodiments in detail, which is not described herein again.

FIG. 8 is a schematic diagram according to an eighth embodiment of the present disclosure; the acoustic model training apparatus 700 of the present embodiment further describes the technical solution of the present application in more detail based on the technical solution of the embodiment shown in fig. 7.

As shown in fig. 8, in the training apparatus 700 for acoustic models according to the present embodiment, the training module 703 includes:

an extracting unit 7031, configured to extract, for each piece of training data, acoustic feature information of a training feature audio slice in the training data by using an acoustic feature extraction model in the acoustic model, so as to obtain training acoustic feature information;

the encoding unit 7032 is configured to encode a training text in the training data by using an encoder in the acoustic model to obtain training text encoding feature information;

a decoding unit 7033, configured to decode based on the training acoustic feature information and the training text coding feature information by using a decoder in the acoustic model to obtain a predicted audio;

a constructing unit 7034 configured to construct a loss function based on the prediction audio and the training audio in the training data;

an adjusting unit 7035, configured to perform parameter adjustment on the acoustic feature extraction model, the encoder, and the decoder if the loss function is not converged, so that the loss function tends to be converged.

Further optionally, in the training apparatus 700 for an acoustic model according to this embodiment, the method further includes:

the collecting module 704 is configured to collect a plurality of training audios and training texts corresponding to the training audios.

In the technical scheme of the disclosure, the acquisition, storage, application and the like of the personal information of the related user all accord with the regulations of related laws and regulations, and do not violate the good customs of the public order.

The present disclosure also provides an electronic device, a readable storage medium, and a computer program product according to embodiments of the present disclosure.

FIG. 9 illustrates a schematic block diagram of an example electronic device 900 that can be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 9, the apparatus 900 includes a computing unit 901, which can perform various appropriate actions and processes in accordance with a computer program stored in a Read Only Memory (ROM)902 or a computer program loaded from a storage unit 908 into a Random Access Memory (RAM) 903. In the RAM 903, various programs and data required for the operation of the device 900 can also be stored. The calculation unit 901, ROM 902, and RAM 903 are connected to each other via a bus 904. An input/output (I/O) interface 905 is also connected to bus 904.

A number of components in the device 900 are connected to the I/O interface 905, including: an input unit 906 such as a keyboard, a mouse, and the like; an output unit 907 such as various types of displays, speakers, and the like; a storage unit 908 such as a magnetic disk, optical disk, or the like; and a communication unit 909 such as a network card, a modem, a wireless communication transceiver, and the like. The communication unit 909 allows the device 900 to exchange information/data with other devices through a computer network such as the internet and/or various telecommunication networks.

The computing unit 901 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of the computing unit 901 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various dedicated Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and so forth. The calculation unit 801 executes the respective methods and processes described above, such as an audio synthesis method or a training method of an acoustic model. For example, in some embodiments, the audio synthesis method or the training method of the acoustic model may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as the storage unit 908. In some embodiments, part or all of the computer program may be loaded and/or installed onto device 900 via ROM 902 and/or communications unit 909. When the computer program is loaded into the RAM 903 and executed by the computing unit 901, one or more steps of the above described audio synthesis method or training method of the acoustic model may be performed. Alternatively, in other embodiments, the computing unit 901 may be configured by any other suitable means (e.g. by means of firmware) to perform an audio synthesis method or a training method of an acoustic model.

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), system on a chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server may be a cloud server, a server of a distributed system, or a server with a combined blockchain.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present disclosure may be executed in parallel, sequentially, or in different orders, as long as the desired results of the technical solutions disclosed in the present disclosure can be achieved, and the present disclosure is not limited herein.

The above detailed description should not be construed as limiting the scope of the disclosure. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present disclosure should be included in the scope of protection of the present disclosure.

Claims

1. A method of audio synthesis, wherein the method comprises:

segmenting the designated audio to obtain a plurality of audio slices;

2. The method of claim 1, wherein synthesizing the corresponding audio using a pre-trained encoder and decoder based on the plurality of acoustic feature information and the specified text comprises:

acquiring target acoustic characteristic information based on the plurality of acoustic characteristic information;

the specified text is coded by the coder to obtain coding characteristic information of the specified text;

and decoding by adopting the decoder to obtain the audio frequency based on the target acoustic characteristic information and the coding characteristic information of the specified text.

3. The method of claim 2, wherein obtaining target acoustic feature information based on the plurality of acoustic feature information comprises:

and calculating to obtain the target acoustic characteristic information based on the plurality of acoustic characteristic information by adopting a mathematical calculation mode.

4. A method of training an acoustic model, wherein the method comprises:

5. The method of claim 4, wherein training an acoustic model using the plurality of pieces of training data comprises:

for each piece of training data, extracting acoustic feature information of the training feature audio slice in the training data by adopting an acoustic feature extraction model in the acoustic model to obtain training acoustic feature information;

coding a training text in the training data by adopting a coder in the acoustic model to obtain coding characteristic information of the training text;

decoding by using a decoder in the acoustic model based on the training acoustic characteristic information and the training text coding characteristic information to obtain a prediction audio;

constructing a loss function based on the predicted audio and the training audio in the training data;

and if the loss function is not converged, performing parameter adjustment on the acoustic feature extraction model, the encoder and the decoder to enable the loss function to tend to be converged.

6. The method of claim 4 or 5, wherein before segmenting at least one of the plurality of acquired training audios into a plurality of training audio slices, the method further comprises:

and acquiring the plurality of training audios and the training texts corresponding to the training audios.

7. An audio synthesis apparatus, wherein the apparatus comprises:

8. The apparatus of claim 7, wherein the synthesis module comprises:

an acquisition unit configured to acquire target acoustic feature information based on the plurality of acoustic feature information;

the encoding unit is used for encoding the specified text by adopting the encoder to obtain the encoding characteristic information of the specified text;

and the decoding unit is used for decoding the audio by adopting the decoder based on the target acoustic characteristic information and the coding characteristic information of the specified text.

9. The apparatus of claim 8, wherein the obtaining unit is configured to:

10. An apparatus for training an acoustic model, wherein the apparatus comprises:

the combination module is used for arranging and combining the training audios, the collected corresponding training texts and the training audio slices to obtain a plurality of pieces of training data;

11. The apparatus of claim 10, wherein the training module comprises:

the extraction unit is used for extracting the acoustic feature information of the training feature audio slice in the training data by adopting an acoustic feature extraction model in the acoustic model for each piece of training data to obtain training acoustic feature information;

the coding unit is used for coding the training text in the training data by adopting a coder in the acoustic model to obtain the coding characteristic information of the training text;

the decoding unit is used for decoding based on the training acoustic characteristic information and the training text coding characteristic information by adopting a decoder in the acoustic model to obtain a prediction audio;

a construction unit configured to construct a loss function based on the prediction audio and the training audio in the training data;

and the adjusting unit is used for adjusting parameters of the acoustic feature extraction model, the encoder and the decoder if the loss function is not converged, so that the loss function tends to be converged.

12. The apparatus of claim 10 or 11, wherein the apparatus further comprises:

and the acquisition module is used for acquiring the plurality of training audios and the training texts corresponding to the training audios.

13. An electronic device, comprising:

at least one processor; and

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-3 or 4-6.

14. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any of claims 1-3 or 4-6.

15. A computer program product comprising a computer program which, when executed by a processor, implements the method according to any one of claims 1-3 or 4-6.