CN116364054A

CN116364054A - Voice synthesis method, device, equipment and storage medium based on diffusion

Info

Publication number: CN116364054A
Application number: CN202310394113.3A
Authority: CN
Inventors: 郭洋; 王健宗
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2023-04-07
Filing date: 2023-04-07
Publication date: 2023-06-30

Abstract

The application provides a speech synthesis method, a device, equipment and a storage medium based on diffusion, wherein the method comprises the following steps: inputting the obtained target sentence into a preset acoustic model for acoustic feature extraction to obtain a Mel frequency spectrum; obtaining a pre-trained diffration vocoder comprising a full connection layer, a first convolution layer, a second convolution layer and a residual block; inputting a preset time step into the full connection layer to obtain first intermediate data; inputting target audio corresponding to the target sentence and the mel frequency spectrum into a first convolution layer to carry out convolution calculation to obtain second intermediate data; adding the first intermediate data, the second intermediate data and the Mel frequency spectrum to obtain third intermediate data; inputting the third intermediate data into a residual block to obtain fourth intermediate data; and inputting the fourth intermediate data into the second convolution layer to perform convolution calculation to obtain the target synthesized voice. The method reduces the parameter amount used for model training by introducing the diffusion vocoder, thereby improving the efficiency of speech synthesis.

Description

Voice synthesis method, device, equipment and storage medium based on diffusion

Technical Field

The present application relates to the field of artificial intelligence technologies, and in particular, to a method, an apparatus, a device, and a storage medium for synthesizing speech based on diffusion.

Background

With the continuous development of deep learning, voice information with high similarity to human voice can be generated by a voice synthesis model based on deep learning. The deep learning-based speech synthesis model is mostly composed of an acoustic model for converting text information into acoustic feature information and a vocoder for generating speech information from the acoustic feature information. The existing vocoder in the deep learning-based voice synthesis model mainly uses a vocoder based on a generation countermeasure network (Generative Adversarial Networks, GAN) or a vocoder based on a FLOW, the problem that the voice synthesis model training is difficult to converge exists in the vocoder based on the GAN, the model parameters required by the vocoder based on the FLOW are large, so that the voice synthesis model needs to consume a large amount of computing resources, the voice synthesis model is difficult to converge, and the voice synthesis efficiency is low.

Disclosure of Invention

The main purpose of the embodiments of the present application is to provide a method, an apparatus, a device, and a storage medium for speech synthesis based on diffusion, which can effectively improve the efficiency of speech synthesis.

To achieve the above object, a first aspect of an embodiment of the present application proposes a speech synthesis method based on a diffusion probability model, the method including:

acquiring a target sentence, inputting the target sentence into a preset acoustic model for acoustic feature extraction, and obtaining a Mel frequency spectrum;

obtaining a pre-trained diffration vocoder, wherein the diffration vocoder comprises a full-connection layer, a first convolution layer, a second convolution layer and a residual block, and the residual block comprises a plurality of residual layers which are connected in sequence;

acquiring a preset time step, and inputting the time step into the full connection layer to obtain first intermediate data;

acquiring target audio corresponding to the target sentence, and inputting the target audio and the Mel frequency spectrum into a first convolution layer for convolution calculation to obtain second intermediate data;

adding the first intermediate data, the second intermediate data and the Mel frequency spectrum to obtain third intermediate data;

inputting the third intermediate data into a plurality of residual layers which are connected in sequence to obtain fourth intermediate data;

and inputting the fourth intermediate data into the second convolution layer to perform convolution calculation to obtain target synthesized voice.

In some embodiments, the target audio includes initial speech distribution information, and the diffusion vocoder is trained by:

acquiring a preset Markov chain;

inputting the initial voice distribution information and the Mel frequency spectrum into the Markov chain for data conversion to obtain Gaussian noise distribution information;

inputting the Gaussian noise distribution information and the Mel frequency spectrum into the Markov chain for data conversion to obtain target voice distribution information;

and training the diffration vocoder according to the target voice distribution information, the Mel frequency spectrum and the time step.

In some embodiments, the inputting the initial speech distribution information and the mel spectrum into the markov chain to perform data conversion to obtain gaussian noise distribution information, where the gaussian noise distribution information is determined according to the following formula:

wherein x is ₀ For the initial speech distribution information, mel is the Mel spectrum, x _t As hidden variable of current time step, x _T For the gaussian noise distribution information, t is the time step, q (x) _t |x _t-1 Mel) is determined according to the following formula:

wherein beta is _t Is constant, I is the information x of the distribution of the initial voice ₀ The corresponding first standard normal distribution information,

beta is the mean value of the Gaussian noise distribution information _t I is the variance of the Gaussian noise distribution information.

In some embodiments, the gaussian noise distribution information and the mel spectrum are input to the markov chain to perform data conversion, so as to obtain target voice distribution information, which is determined according to the following formula:

wherein x is ₀ For the target voice distribution information, mel is the mel frequency spectrum, t is the time step, p _θ (x _t-1 |x _t Mel) is determined according to the following formula:

p _θ (x _t-1 |x _t ,mel)＝N(x _t-1 ；μ _θ (x _t ,t)；σ _θ (x _t ,t) ² I)；

wherein mu _θ Sum sigma _θ And t is the time step length and is a preset model to be trained.

In some embodiments, the training the diffusion vocoder according to the target speech distribution information, the mel spectrum, and the time step comprises:

acquiring a preset initial diffusion vocoder;

constructing a target loss function according to the target voice distribution information, the Mel frequency spectrum and the time step;

and training the initial diffusion vocoder according to the target loss function to obtain the diffusion vocoder.

In some embodiments, the objective loss function is determined according to the following formula:

Wherein mel is the Mel spectrum, x ₀ Epsilon is the original voice distribution information x of the target voice ₀ Corresponding second standard normal distribution information epsilon _θ For the diffusion vocoder, t is the time step,

and alpha _t The determination is made according to the following equation:

α _t ＝1-β _t ；

wherein said beta _t Is constant.

In some embodiments, each residual layer includes a third convolution layer, a fourth convolution layer, a tanh layer, and a Sigmoid layer, and the inputting the third intermediate data to a plurality of sequentially connected residual layers, to obtain fourth intermediate data includes:

inputting the third intermediate data into the third convolution layer to perform convolution calculation to obtain fifth intermediate data;

inputting the third intermediate data into the tanh layer for data activation processing to obtain first activation data;

inputting the third intermediate data to the Sigmoid layer for data activation processing to obtain second activation data;

multiplying the first activation data and the second activation data to obtain target activation data;

inputting the target activation data into the fourth convolution layer to perform convolution calculation to obtain sixth intermediate data;

And adding the fifth intermediate data and the sixth intermediate data to obtain the fourth intermediate data.

To achieve the above object, a second aspect of the embodiments of the present application proposes a speech synthesis apparatus based on diffusion, the apparatus comprising:

the acoustic feature conversion module is used for acquiring a target sentence, inputting the target sentence into a preset acoustic model for data conversion processing, and obtaining a Mel frequency spectrum;

the model acquisition module is used for acquiring a pre-trained diffusion vocoder, wherein the diffusion vocoder comprises a full-connection layer, a first convolution layer, a second convolution layer and a residual block, and the residual block comprises a plurality of residual layers which are connected in sequence;

the first data processing module is used for acquiring a preset time step, and inputting the time step into the full-connection layer to obtain first intermediate data;

the second data processing module is used for acquiring target audio corresponding to the target sentence, inputting the target audio and the Mel frequency spectrum into the first convolution layer for convolution calculation, and obtaining second intermediate data;

the third data processing module is used for adding the first intermediate data, the second intermediate data and the Mel frequency spectrum to obtain third intermediate data;

The fourth data processing module is used for inputting the third intermediate data into a plurality of residual layers which are connected in sequence to obtain fourth intermediate data;

and the target synthetic voice acquisition module is used for inputting the fourth intermediate data into the second convolution layer to carry out convolution calculation so as to obtain target synthetic voice.

To achieve the above object, a third aspect of the embodiments of the present application provides an electronic device, where the electronic device includes a memory and a processor, where the memory stores a computer program, and the processor implements the speech synthesis method based on diffusion according to the first aspect when executing the computer program.

To achieve the above object, a fourth aspect of the embodiments of the present application proposes a storage medium, which is a computer-readable storage medium, storing a computer program, which when executed by a processor implements the method described in the first aspect.

According to the voice synthesis method, device, equipment and storage medium based on the diffusion, through obtaining a target sentence, inputting the target sentence into a preset acoustic model for acoustic feature extraction, and obtaining a Mel frequency spectrum; obtaining a pre-trained diffration vocoder, wherein the diffration vocoder comprises a full-connection layer, a first convolution layer, a second convolution layer and a residual block, and the residual block comprises a plurality of residual layers which are connected in sequence; acquiring a preset time step, and inputting the time step into the full connection layer to obtain first intermediate data; acquiring target audio corresponding to the target sentence, and inputting the target audio and the Mel frequency spectrum into a first convolution layer for convolution calculation to obtain second intermediate data; adding the first intermediate data, the second intermediate data and the Mel frequency spectrum to obtain third intermediate data; inputting the third intermediate data into a plurality of residual layers which are connected in sequence to obtain fourth intermediate data; and inputting the fourth intermediate data into the second convolution layer to perform convolution calculation to obtain target synthesized voice. According to the technical scheme of the embodiment, the parameter amount for model training is reduced by introducing the diffusion vocoder, so that the efficiency of speech synthesis is improved.

Drawings

FIG. 1 is a flow chart of steps of a method for speech synthesis based on diffusion according to one embodiment of the present application;

FIG. 2 is a flowchart illustrating steps for training a diffration vocoder according to another embodiment of the present invention;

FIG. 3 is a flowchart illustrating steps for training a diffration vocoder according to another embodiment of the present invention;

FIG. 4 is a flowchart illustrating steps for obtaining fourth intermediate data according to another embodiment of the present application;

FIG. 5 is a schematic block diagram of a speech synthesis apparatus based on a diffusion according to another embodiment of the present application;

FIG. 6 is a network configuration diagram of a diffusion vocoder provided in another embodiment of the present application;

FIG. 7 is a schematic diagram of a residual layer provided by another embodiment of the present application;

fig. 8 is a schematic hardware structure of an electronic device according to another embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application will be further described in detail with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the present application.

It should be noted that although functional block division is performed in a device diagram and a logic sequence is shown in a flowchart, in some cases, the steps shown or described may be performed in a different order than the block division in the device, or in the flowchart. The terms first, second and the like in the description and in the claims and in the above-described figures, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used herein is for the purpose of describing embodiments of the present application only and is not intended to be limiting of the present application.

First, several nouns referred to in this application are parsed:

artificial intelligence (artificial intelligence, AI): is a new technical science for researching and developing theories, methods, technologies and application systems for simulating, extending and expanding the intelligence of people; artificial intelligence is a branch of computer science that attempts to understand the nature of intelligence and to produce a new intelligent machine that can react in a manner similar to human intelligence, research in this field including robotics, language recognition, image recognition, natural language processing, and expert systems. Artificial intelligence can simulate the information process of consciousness and thinking of people. Artificial intelligence is also a theory, method, technique, and application system that utilizes a digital computer or digital computer-controlled machine to simulate, extend, and expand human intelligence, sense the environment, acquire knowledge, and use knowledge to obtain optimal results.

Natural language processing (natural language processing, NLP): NLP is a branch of artificial intelligence that is a interdisciplinary of computer science and linguistics, and is often referred to as computational linguistics, and is processed, understood, and applied to human languages (e.g., chinese, english, etc.). Natural language processing includes parsing, semantic analysis, chapter understanding, and the like. Natural language processing is commonly used in the technical fields of machine translation, handwriting and print character recognition, voice recognition and text-to-speech conversion, information intent recognition, information extraction and filtering, text classification and clustering, public opinion analysis and opinion mining, and the like, and relates to data mining, machine learning, knowledge acquisition, knowledge engineering, artificial intelligence research, linguistic research related to language calculation, and the like.

Information extraction (Information Extraction): extracting the fact information of the appointed type of entity, relation, event and the like from the natural language text, and forming the text processing technology of the structured data output. Information extraction is a technique for extracting specific information from text data. Text data is made up of specific units, such as sentences, paragraphs, chapters, and text information is made up of small specific units, such as words, phrases, sentences, paragraphs, or a combination of these specific units. The noun phrase, the name of a person, the name of a place, etc. in the extracted text data are all text information extraction, and of course, the information extracted by the text information extraction technology can be various types of information.

Based on the above, the embodiment of the application provides a speech synthesis method, a device, equipment and a storage medium based on diffusion, and a mel frequency spectrum is obtained by acquiring a target sentence, inputting the target sentence into a preset acoustic model for acoustic feature extraction; obtaining a pre-trained diffration vocoder, wherein the diffration vocoder comprises a full-connection layer, a first convolution layer, a second convolution layer and a residual block, and the residual block comprises a plurality of residual layers which are connected in sequence; acquiring a preset time step, and inputting the time step into the full connection layer to obtain first intermediate data; acquiring target audio corresponding to the target sentence, and inputting the target audio and the Mel frequency spectrum into a first convolution layer for convolution calculation to obtain second intermediate data; adding the first intermediate data, the second intermediate data and the Mel frequency spectrum to obtain third intermediate data; inputting the third intermediate data into a plurality of residual layers which are connected in sequence to obtain fourth intermediate data; and inputting the fourth intermediate data into the second convolution layer to perform convolution calculation to obtain target synthesized voice. According to the technical scheme of the embodiment of the application, the parameter quantity used for model training is reduced by introducing the diffusion vocoder, so that the efficiency of speech synthesis is improved.

The method, device, equipment and storage medium for synthesizing voice based on the diffusion provided by the embodiment of the application are specifically described through the following embodiments, and the method for synthesizing voice based on the diffusion in the embodiment of the application is described first.

The embodiment of the application can acquire and process the related data based on the artificial intelligence technology. Among these, artificial intelligence (Artificial Intelligence, AI) is the theory, method, technique and application system that uses a digital computer or a digital computer-controlled machine to simulate, extend and extend human intelligence, sense the environment, acquire knowledge and use knowledge to obtain optimal results.

Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a robot technology, a biological recognition technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and other directions.

The embodiment of the application provides a speech synthesis method based on diffusion, and relates to the technical field of artificial intelligence. The speech synthesis based on the diffusion provided by the embodiment of the application can be applied to a terminal, a server side or software running in the terminal or the server side. In some embodiments, the terminal may be a smart phone, tablet, notebook, desktop, etc.; the server side can be configured as an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, and a cloud server for providing cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, CDNs, basic cloud computing services such as big data and artificial intelligent platforms and the like; the software may be an application or the like that implements the speech synthesis based on the diffusion, but is not limited to the above form.

The subject application is operational with numerous general purpose or special purpose computer system environments or configurations. For example: personal computers, server computers, hand-held or portable devices, tablet devices, multiprocessor systems, microprocessor-based systems, set top boxes, programmable consumer electronics, network PCs, minicomputers, mainframe computers, distributed computing environments that include any of the above systems or devices, and the like. The application may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The application may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.

Fig. 1 is a step flowchart of a speech synthesis method based on diffusion according to an embodiment of the present application, and the method in fig. 1 may include, but is not limited to, steps S110 to S170.

Step S110, obtaining a target sentence, inputting the target sentence into a preset acoustic model for acoustic feature extraction, and obtaining a Mel frequency spectrum;

step S120, obtaining a pre-trained diffusion vocoder, wherein the diffusion vocoder comprises a full-connection layer, a first convolution layer, a second convolution layer and a residual block, and the residual block comprises a plurality of residual layers which are connected in sequence;

step S130, acquiring a preset time step, and inputting the time step into a full connection layer to obtain first intermediate data;

step S140, obtaining target audio corresponding to the target sentence, and inputting the target audio and the Mel frequency spectrum into a first convolution layer for convolution calculation to obtain second intermediate data;

step S150, adding the first intermediate data, the second intermediate data and the Mel frequency spectrum to obtain third intermediate data;

step S160, inputting the third intermediate data into a plurality of residual layers which are sequentially connected to obtain fourth intermediate data;

step S170, inputting the fourth intermediate data into the second convolution layer for convolution calculation to obtain the target synthesized voice.

It should be noted that, the embodiment of the present application does not limit the specific network structure of the acoustic model, and a person skilled in the art may select an available acoustic model according to actual requirements, and may implement audio conversion on a target sentence, obtain corresponding phoneme features, and extract mel spectrum corresponding to the phoneme features.

It should be noted that, in the embodiment of the present application, the specific number of the residual layers in the residual block in the residual vocoder is not limited, and a person skilled in the art may determine according to actual needs, as shown in fig. 6, fig. 6 is a network structure diagram of the residual vocoder provided in another embodiment of the present application, the residual vocoder includes a full connection layer, a first convolution layer, a second convolution layer, and a residual block, the residual block includes 4 residual layers (a first residual layer, a second residual layer, a third residual layer, and a fourth residual layer), after the third intermediate data is input to the residual block, the output of the previous residual layer serves as the input of the next residual layer, that is, the intermediate result obtained by inputting the third intermediate data to the first residual layer serves as the input of the second residual layer, and so on, the output result of the second residual layer serves as the input of the third residual layer, and the output result of the third residual layer serves as the input of the fourth residual layer.

It should be noted that, the embodiment of the present application does not limit the specific content of the target sentence, and may be chinese text information or english text information.

It should be noted that, the embodiment of the present application does not limit the specific manner of obtaining the target sentence, and the target sentence may be obtained by inputting the sentence to be processed into a preset NLP model for semantic recognition; the sentence to be processed can be obtained by extracting the text information according to a preset rule, and the sentence to be processed can be selected and used by a person skilled in the art according to the actual situation, so that the sentence to be processed is not limited.

It should be noted that, before inputting the target sentence into the preset acoustic model to perform acoustic feature extraction, and obtaining the mel spectrum, the embodiment of the application may further include the following steps: the method comprises the steps of obtaining a preset data preprocessing rule, and carrying out data preprocessing on a target sentence according to the data preprocessing rule, wherein it can be understood that the process of obtaining the Mel frequency spectrum corresponding to the target sentence according to an acoustic model is to convert the target sentence into a phoneme characteristic, then convert the phoneme characteristic into the Mel frequency spectrum, and carrying out data preprocessing on the target sentence before converting the target sentence into the phoneme characteristic, so that abnormal information of the target sentence, such as abnormal punctuation marks and the like, can be removed, and the availability of the Mel frequency spectrum is improved.

It can be understood that, in steps S110 to S170 illustrated in the embodiments of the present application, a target sentence is obtained, and the target sentence is input to a preset acoustic model to perform acoustic feature extraction, so as to obtain a mel frequency spectrum; obtaining a pre-trained diffration vocoder, wherein the diffration vocoder comprises a full-connection layer, a first convolution layer, a second convolution layer and a residual block, and the residual block comprises a plurality of residual layers which are connected in sequence; acquiring a preset time step, and inputting the time step into the full connection layer to obtain first intermediate data; acquiring target audio corresponding to the target sentence, and inputting the target audio and the Mel frequency spectrum into a first convolution layer for convolution calculation to obtain second intermediate data; adding the first intermediate data, the second intermediate data and the Mel frequency spectrum to obtain third intermediate data; inputting the third intermediate data into a plurality of residual layers which are connected in sequence to obtain fourth intermediate data; and inputting the fourth intermediate data into the second convolution layer to perform convolution calculation to obtain target synthesized voice. According to the embodiment of the application, the introduction of the diffusion vocoder reduces the parameter quantity used for model training, and compared with the FLOW-based vocoder which is large in model parameter quantity, difficult to converge by a model and low in voice synthesis efficiency and is required in the related technology, the FLOW-based vocoder can effectively improve the voice synthesis efficiency.

In some embodiments, the target audio includes initial speech distribution information, and referring to fig. 2, the training step of the diffusion vocoder may include, but is not limited to, steps S210 to S240:

step S210, acquiring a preset Markov chain;

step S220, inputting the initial voice distribution information and the Mel frequency spectrum into a Markov chain for data conversion to obtain Gaussian noise distribution information;

step S230, gaussian noise distribution information and a Mel frequency spectrum are input into a Markov chain for data conversion, and target voice distribution information is obtained;

step S240, training the diffusion vocoder according to the target voice distribution information, the Mel frequency spectrum and the time step.

It can be understood that the diffusion vocoder can be regarded as a diffusion probability model, the diffusion probability model is based on a markov chain, the initial speech distribution information and the mel frequency spectrum are input into the markov chain to perform data conversion, the step of obtaining gaussian noise distribution information is specific to the diffusion process of the markov chain, the joint probability distribution of the diffusion process of the markov chain is calculated according to the target audio frequency, the time step and the mel frequency spectrum, so that training from the initial speech distribution information to the gaussian noise distribution information is realized, and the realization formula can be as follows:

Wherein x is ₀ For initial speech distribution information, mel is mel spectrum, x _t As hidden variable of current time step, x _T Is Gaussian noise distribution information, t is the time step, q (x _t |x _t-1 Mel) is expressed as follows:

wherein beta is _t Is constant, I is the information x of the initial voice distribution ₀ The corresponding first standard normal distribution information,

is the mean value of Gaussian noise distribution information, beta _t I is the variance of the Gaussian noise distribution information.

It can be understood that, the step of inputting gaussian noise distribution information and mel spectrum to a markov chain to perform data conversion to obtain target voice distribution information is aimed at the inverse process of the diffusion process of the markov chain, and the embodiment calculates joint probability distribution of the inverse process of the markov chain according to the target audio, the time step and the mel spectrum, so as to realize decoding from the gaussian noise distribution information to the target voice distribution information, and the implementation formula can be as follows:

wherein x is ₀ For the target voice distribution information, mel is Mel frequency spectrum, t is time step, and p _θ (x _t-1 |x _t Mel) is determined according to the following formula:

wherein mu _θ Sum sigma _θ And t is a time step, wherein the time step is a preset model to be trained.

In some embodiments, referring to fig. 3, step S240 may include, but is not limited to, steps S310 to S330:

Step S310, obtaining a preset initial diffusion vocoder;

step S320, constructing a target loss function according to the target voice distribution information, the Mel frequency spectrum and the time step;

step S330, training the initial vocoder according to the target loss function to obtain the vocoder.

It should be noted that, the objective loss function is determined according to the following formula:

wherein mel is Mel spectrum, x ₀ For the target voice distribution information, epsilon is the information x of the target initial voice distribution ₀ Corresponding second standard normal distribution information epsilon _θ In the case of a diffusion vocoder, t is the time step,

and alpha _t The determination is made according to the following equation:

α _t ＝1-β _t ；

wherein beta is _t Is constant.

It will be appreciated that the initial diffusion vocoder is trained by a target loss function constructed from target speech distribution information, mel spectrum and time step, thereby achieving optimization of the initial diffusion vocoder.

In some embodiments, each residual layer includes a third convolution layer, a fourth convolution layer, a tanh layer, and a Sigmoid layer, referring to fig. 4, step S160 may include, but is not limited to, steps S410 through S460:

step S410, inputting the third intermediate data into a third convolution layer for convolution calculation to obtain fifth intermediate data;

Step S420, inputting third intermediate data into the tanh layer for data activation processing to obtain first activation data;

step S430, inputting the third intermediate data to the Sigmoid layer for data activation processing to obtain second activation data;

step S440, multiplying the first activation data and the second activation data to obtain target activation data;

step S450, inputting target activation data into a fourth convolution layer to perform convolution calculation to obtain sixth intermediate data;

step S460, performing addition processing on the fifth intermediate data and the sixth intermediate data to obtain fourth intermediate data.

It should be noted that, in the embodiment of the present application, the specific structure of each residual layer in the residual block is not limited, as shown in fig. 7, fig. 7 is a schematic diagram of the residual layer provided in another embodiment of the present application, where the residual layer in the embodiment of the present application is composed of a bidirectional expansion convolution layer, that is, a third convolution layer, a fourth convolution layer, and two activation function layers, and the two activation function layers are respectively a tanh layer and a Sigmoid layer, and based on the residual layer structure provided in the embodiment, the specific calculation steps for obtaining the fourth intermediate data after the third intermediate data is input into the residual layer are as follows: inputting the third intermediate data into a third convolution layer for convolution calculation to obtain fifth intermediate data; inputting the third intermediate data into the tanh layer for data activation processing to obtain first activation data; inputting the third intermediate data to the Sigmoid layer for data activation processing to obtain second activation data; multiplying the first activation data and the second activation data to obtain target activation data; inputting the target activation data into a fourth convolution layer to perform convolution calculation to obtain sixth intermediate data; and adding the fifth intermediate data and the sixth intermediate data to obtain fourth intermediate data.

Referring to fig. 5, the embodiment of the present application further provides a speech synthesis apparatus based on diffusion, which may implement the above speech synthesis method based on diffusion, where the speech synthesis apparatus 500 based on diffusion includes:

the acoustic feature conversion module 510 is configured to obtain a target sentence, input the target sentence into a preset acoustic model, and perform data conversion processing to obtain a mel frequency spectrum;

the model obtaining module 520 is configured to obtain a pre-trained diffration vocoder, where the diffration vocoder includes a full connection layer, a first convolution layer, a second convolution layer, and a residual block, and the residual block includes a plurality of residual layers that are sequentially connected;

the first data processing module 530 is configured to obtain a preset time step, and input the time step to the full connection layer to obtain first intermediate data;

the second data processing module 540 is configured to obtain target audio corresponding to the target sentence, input the target audio and mel spectrum to the first convolution layer for convolution calculation, and obtain second intermediate data;

a third data processing module 550, configured to perform an addition process on the first intermediate data, the second intermediate data, and the mel spectrum, to obtain third intermediate data;

a fourth data processing module 560, configured to input third intermediate data to a plurality of residual layers that are sequentially connected to obtain fourth intermediate data;

The target synthesized speech acquisition module 570 is configured to input the fourth intermediate data to the second convolution layer for performing convolution calculation, so as to obtain a target synthesized speech.

The specific implementation of the speech synthesis apparatus based on the diffusion is substantially the same as the specific example of the speech synthesis method based on the diffusion, and will not be described herein.

The embodiment of the application also provides electronic equipment, which comprises a memory and a processor, wherein the memory stores a computer program, and the processor realizes the speech synthesis based on the diffusion when executing the computer program. The electronic equipment can be any intelligent terminal including a tablet personal computer, a vehicle-mounted computer and the like.

Referring to fig. 8, fig. 8 is a schematic hardware structure of an electronic device according to another embodiment of the present application, where the electronic device includes:

the processor 810 may be implemented by a general purpose Central processing unit (Central ProcessingUnit, CPU), a microprocessor, an application specific integrated circuit (Application Specific Integrated Circuit, ASIC), or one or more integrated circuits, etc., for executing related programs to implement the technical solutions provided by the embodiments of the present application;

the Memory 820 may be implemented in the form of a Read Only Memory (ROM), a static storage device, a dynamic storage device, or a random access Memory (Random Access Memory, RAM). The memory 820 may store an operating system and other application programs, and when the technical solution provided in the embodiments of the present disclosure is implemented by software or firmware, relevant program codes are stored in the memory 820, and the processor 810 invokes the performing of the speech synthesis method based on dispersion in the embodiments of the present disclosure, and when the processor 810 executes the speech synthesis method based on dispersion in the embodiments of the present disclosure, for example, the performing of the method steps S110 to S170 in fig. 1, the method steps S210 to S240 in fig. 2, the method steps S310 to S330 in fig. 3, and the method steps S410 to S460 in fig. 4 described above;

An input/output interface 830 for implementing information input and output;

the communication interface 840 is configured to implement communication interaction between the device and other devices, and may implement communication in a wired manner (e.g. USB, network cable, etc.), or may implement communication in a wireless manner (e.g. mobile network, WIFI, bluetooth, etc.);

bus 850 transfers information between the various components of the device (e.g., processor 810, memory 820, input/output interface 830, and communication interface 840);

wherein processor 810, memory 820, input/output interface 830, and communication interface 840 enable communication connections among each other within the device via bus 905.

In addition, an embodiment of the present invention further provides a computer-readable storage medium storing computer-executable instructions that are executed by a processor or controller, for example, by a processor 810 in an embodiment of the electronic device 800, so that the processor performs the method for adjusting the neural network model in the embodiment, for example, the method steps S110 to S170 in fig. 1, the method steps S210 to S240 in fig. 2, the method steps S310 to S330 in fig. 3, and the method steps S410 to S460 in fig. 4 described above.

The memory, as a non-transitory computer readable storage medium, may be used to store non-transitory software programs as well as non-transitory computer executable programs. In addition, the memory may include high-speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, the memory optionally includes memory remotely located relative to the processor, the remote memory being connectable to the processor through a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

According to the voice synthesis method, device, equipment and storage medium based on the diffusion, through obtaining a target sentence, inputting the target sentence into a preset acoustic model for acoustic feature extraction, and obtaining a Mel frequency spectrum; obtaining a pre-trained diffration vocoder, wherein the diffration vocoder comprises a full-connection layer, a first convolution layer, a second convolution layer and a residual block, and the residual block comprises a plurality of residual layers which are connected in sequence; acquiring a preset time step, and inputting the time step into the full connection layer to obtain first intermediate data; acquiring target audio corresponding to the target sentence, and inputting the target audio and the Mel frequency spectrum into a first convolution layer for convolution calculation to obtain second intermediate data; adding the first intermediate data, the second intermediate data and the Mel frequency spectrum to obtain third intermediate data; inputting the third intermediate data into a plurality of residual layers which are connected in sequence to obtain fourth intermediate data; and inputting the fourth intermediate data into the second convolution layer to perform convolution calculation to obtain target synthesized voice. According to the technical scheme of the embodiment of the application, the parameter quantity used for model training is reduced by introducing the diffusion vocoder, so that the efficiency of speech synthesis is improved.

The embodiments described in the embodiments of the present application are for more clearly describing the technical solutions of the embodiments of the present application, and do not constitute a limitation on the technical solutions provided by the embodiments of the present application, and as those skilled in the art can know that, with the evolution of technology and the appearance of new application scenarios, the technical solutions provided by the embodiments of the present application are equally applicable to similar technical problems.

It will be appreciated by those skilled in the art that the technical solutions shown in the figures do not constitute limitations of the embodiments of the present application, and may include more or fewer steps than shown, or may combine certain steps, or different steps.

The above described apparatus embodiments are merely illustrative, wherein the units illustrated as separate components may or may not be physically separate, i.e. may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

Those of ordinary skill in the art will appreciate that all or some of the steps of the methods, systems, functional modules/units in the devices disclosed above may be implemented as software, firmware, hardware, and suitable combinations thereof.

The terms "first," "second," "third," "fourth," and the like in the description of the present application and in the above-described figures, if any, are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that embodiments of the present application described herein may be implemented in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

It should be understood that in this application, "at least one" means one or more, and "a plurality" means two or more. "and/or" for describing the association relationship of the association object, the representation may have three relationships, for example, "a and/or B" may represent: only a, only B and both a and B are present, wherein a, B may be singular or plural. The character "/" generally indicates that the context-dependent object is an "or" relationship. "at least one of" or the like means any combination of these items, including any combination of single item(s) or plural items(s). For example, at least one (one) of a, b or c may represent: a, b, c, "a and b", "a and c", "b and c", or "a and b and c", wherein a, b, c may be single or plural.

In the several embodiments provided in this application, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the above-described division of units is merely a logical function division, and there may be another division manner in actual implementation, for example, a plurality of units or components may be combined or may be integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.

The units described above as separate components may or may not be physically separate, and components shown as units may or may not be physical units, may be located in one place, or may be distributed over a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in each embodiment of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.

The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be embodied essentially or in part or all of the technical solution or in part in the form of a software product stored in a storage medium, including multiple instructions to cause a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the methods of the various embodiments of the present application. And the aforementioned storage medium includes: various media capable of storing programs, such as a U disk, a removable hard disk, a ROM, a RAM, a magnetic disk, or an optical disk.

The present embodiments are operational with numerous general purpose or special purpose computer device environments or configurations. For example: personal computers, server computers, hand-held or portable electronic devices, tablet electronic devices, multiprocessor devices, microprocessor-based devices, set top boxes, programmable consumer electronics, network PCs, minicomputers, mainframe computers, distributed computing environments that include any of the above devices or electronic devices, and the like. The application may be described in the general context of computer programs, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The application may also be practiced in distributed computing environments where tasks are performed by remote processing electronic devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.

The units involved in the embodiments of the present application may be implemented by means of software, or may be implemented by means of hardware, and the described units may also be provided in a processor. Wherein the names of the units do not constitute a limitation of the units themselves in some cases.

It should be noted that although in the above detailed description several modules or units of an electronic device for action execution are mentioned, such a division is not mandatory. Indeed, the features and functions of two or more modules or units described above may be embodied in one module or unit, in accordance with embodiments of the present application. Conversely, the features and functions of one module or unit described above may be further divided into a plurality of modules or units to be embodied.

From the above description of embodiments, those skilled in the art will readily appreciate that the example embodiments described herein may be implemented in software, or may be implemented in software in combination with the necessary hardware. Thus, the technical solution according to the embodiments of the present application may be embodied in the form of a software product, which may be stored in a non-volatile storage medium (may be a CD-ROM, a usb disk, a mobile hard disk, etc.) or on a network, and includes several instructions to cause a computing electronic device (may be a personal computer, a server, a touch terminal, or a network electronic device, etc.) to perform the method according to the embodiments of the present application.

Other embodiments of the present application will be apparent to those skilled in the art from consideration of the specification and practice of the embodiments disclosed herein. This application is intended to cover any variations, uses, or adaptations of the application following, in general, the principles of the application and including such departures from the present disclosure as come within known or customary practice within the art to which the application pertains.

It is to be understood that the present application is not limited to the precise arrangements and instrumentalities shown in the drawings, which have been described above, and that various modifications and changes may be effected without departing from the scope thereof. The scope of the application is limited only by the appended claims.

While the preferred embodiments of the present application have been described in detail, the present application is not limited to the above embodiments, and various equivalent modifications and substitutions can be made by those skilled in the art without departing from the spirit of the present application, and these equivalent modifications and substitutions are intended to be included in the scope of the present application as defined in the appended claims.

Claims

1. A method of speech synthesis based on diffusion, the method comprising:

2. The method of claim 1, wherein the target audio includes initial speech distribution information, and the diffusion vocoder is trained by:

acquiring a preset Markov chain;

3. The speech synthesis method according to claim 2, wherein the initial speech distribution information and the mel spectrum are input to the markov chain for data conversion to obtain gaussian noise distribution information, and the gaussian noise distribution information is determined according to the following formula:

4. The speech synthesis method according to claim 2, wherein the gaussian noise distribution information and the mel spectrum are input to the markov chain to perform data conversion, so as to obtain target speech distribution information, and the target speech distribution information is determined according to the following formula:

5. The method of claim 2, wherein the training the diffusion vocoder according to the target voice distribution information, the mel spectrum, and the time step comprises:

acquiring a preset initial diffusion vocoder;

6. The method of speech synthesis based on diffusion according to claim 5, wherein the objective loss function is determined according to the following formula:

And alpha _t The determination is made according to the following equation:

α _t ＝1-β _t ；

wherein said beta _t Is constant.

7. The speech synthesis method according to claim 1, wherein each residual layer includes a third convolution layer, a fourth convolution layer, a tanh layer and a Sigmoid layer, and the inputting the third intermediate data into a plurality of sequentially connected residual layers to obtain fourth intermediate data includes:

8. A speech synthesis apparatus based on diffusion, the apparatus comprising:

9. An electronic device comprising a memory storing a computer program and a processor implementing the method of speech synthesis based on diffusion of any of claims 1 to 7 when the computer program is executed by the processor.

10. A computer readable storage medium storing a computer program, characterized in that the computer program, when executed by a processor, implements the speech synthesis method based on diffusion of any one of claims 1 to 7.