CN117292024A

CN117292024A - Voice-based image generation method and device, medium and electronic equipment

Info

Publication number: CN117292024A
Application number: CN202311580355.8A
Authority: CN
Inventors: 孔欧
Original assignee: Shanghai Mido Technology Co ltd
Current assignee: Shanghai Mido Technology Co ltd
Priority date: 2023-11-24
Filing date: 2023-11-24
Publication date: 2023-12-26
Anticipated expiration: 2043-11-24
Also published as: CN117292024B

Abstract

The application provides a voice-based image generation method, a voice-based image generation device, a voice-based image generation medium and electronic equipment. The image generation method comprises the following steps: acquiring voice for generating a target image; processing the voice through an audio encoder to obtain an embedded vector of the voice; obtaining an embedded vector of a sample; connecting the embedded vector of the voice and the embedded vector of the sample to obtain a connection vector; processing the connection vector and the embedded vector of the voice through a deep learning model to obtain a noise vector and a denoising vector of the connection vector based on the noise vector and the connection vector, wherein the deep learning model comprises an encoder and a first decoder based on a cross-attention mechanism; the de-noised vector is processed by a second decoder to obtain the target image. The image generation method can reduce the complexity of the whole image generation process and can improve the accuracy of the generated target image.

Description

Voice-based image generation method and device, medium and electronic equipment

Technical Field

The present application relates to a method for generating an image based on voice, and in particular, to a method, an apparatus, a medium, and an electronic device for generating an image based on voice.

Background

Speech recognition is a technology that has been developed rapidly in recent years, and is now widely used in daily life production of people, text generation recording is performed by speech, operation instruction recognition, and the like.

At present, an Image generating method based on voice is generally generated by a task cascading method, for example, an ASR (Automatic Speech Recognition, automatic voice recognition) +text2Image (Text-to-Image) cascading method is used for generating, and because the method at least needs a plurality of models, the method has the problems of higher complexity of the whole Image generating process and lower accuracy of generated images.

Disclosure of Invention

The invention aims to provide a voice-based image generation method, a voice-based image generation device, a voice-based image generation medium and an electronic device, which are used for solving the problems of high complexity of an overall image generation process and low accuracy of a generated image in the existing image generation method.

In a first aspect, the present application provides a voice-based image generation method, the image generation method including: acquiring voice for generating a target image; processing the voice through an audio encoder to obtain an embedded vector of the voice; obtaining an embedded vector of a sample, wherein the sample is randomly sampled from Gaussian distribution; connecting the embedded vector of the voice and the embedded vector of the sample to obtain a connection vector; processing the connection vector and the embedded vector of the voice through a deep learning model to obtain a noise vector and a denoising vector of the connection vector based on the noise vector and the connection vector, wherein the deep learning model comprises an encoder and a first decoder based on a cross-attention mechanism; and processing the denoising vector of the connection vector through a second decoder to acquire the target image.

In the image generation method, the image can be directly generated according to the voice only by introducing the deep learning model, the cascade of tasks is not involved, the complexity of the whole image generation process is low, and the cascade of tasks is not involved, so that the problems of information loss and distortion easily existing between different tasks are avoided, and the precision of the finally generated target image is high.

In an embodiment of the present application, the implementation method for processing the connection vector and the embedded vector of the voice through a deep learning model to obtain a noise vector and obtain a denoising vector of the connection vector based on the noise vector and the connection vector includes:

s1: processing the connection vector and the embedded vector of the voice through the deep learning model to obtain a noise vector and a denoising vector of the connection vector based on the noise vector and the connection vector;

s2: if the current iteration number is smaller than the preset iteration number, turning to S1, adding 1 to the value of the current iteration number, wherein the connection vector in S1 is an updated connection vector, the updated connection vector is a denoising vector of the connection vector, and otherwise, acquiring the denoising vector of the connection vector.

In an embodiment of the present application, the implementation method for processing the connection vector and the embedded vector of the speech through the deep learning model to obtain the noise vector includes: performing first cross fusion processing on the connection vector and the embedded vector of the voice through the stacked encoders to obtain a fusion vector; and performing second cross fusion processing on the fusion vector and the connection vector through the stacked first decoder so as to acquire the noise vector.

In an embodiment of the present application, the implementation method for obtaining the fusion vector by performing a first cross fusion process on the connection vector and the embedded vector of the speech through the stacked encoders includes: and performing first cross fusion processing on a first query vector, a first value vector and a first key vector through the stacked encoders to obtain the fusion vector, wherein the first query vector and the first key vector are both connection vectors, and the first value vector is an embedded vector of the voice.

In an embodiment of the present application, the implementation method for obtaining the noise vector by performing a second cross fusion process on the fusion vector and the connection vector by using the stacked first decoder includes: and performing second cross fusion processing on a second query vector, a second value vector and a second key vector through the stacked first decoder to acquire the noise vector, wherein the second query vector and the second key vector are both the connection vector, and the second value vector is the fusion vector.

In an embodiment of the present application, the denoising vector of the connection vector is a vector obtained by subtracting the connection vector from the noise vector.

In an embodiment of the present application, the second decoder is a decoder in a variant automatic encoder.

In a second aspect, the present application provides a voice-based image generation apparatus, the image generation apparatus comprising: a voice acquisition module for acquiring voice for generating a target image; the voice processing module is used for processing the voice through the audio encoder so as to acquire an embedded vector of the voice; the sample acquisition module is used for acquiring an embedded vector of a sample, wherein the sample is randomly sampled from Gaussian distribution; the connection processing module is used for carrying out connection processing on the embedded vector of the voice and the embedded vector of the sample so as to obtain a connection vector; the noise acquisition module is used for processing the connection vector and the embedded vector of the voice through a deep learning model to acquire a noise vector and a denoising vector of the connection vector based on the noise vector and the connection vector, wherein the deep learning model consists of an encoder and a first decoder based on a cross attention mechanism; and the image acquisition module is used for processing the denoising vector of the connection vector through a second decoder so as to acquire the target image.

In a third aspect, the present application provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the image generation method according to any of the first aspects of the present application.

In a fourth aspect, the present application provides an electronic device, including: a memory storing a computer program; and the processor is in communication connection with the memory and executes the image generation method according to any one of the first aspect of the application when the computer program is called.

As described above, the method, the device, the medium and the electronic device for generating the voice-based image have the following beneficial effects:

Drawings

Fig. 1 is a schematic diagram of a hardware structure for running the image generating method according to the embodiment of the present application.

Fig. 2 is a flowchart of an image generating method according to an embodiment of the present application.

Fig. 3 is a flowchart of an implementation method for processing the connection vector and the embedded vector of the voice through a deep learning model to obtain a noise vector and obtain a denoising vector of the connection vector based on the noise vector and the connection vector according to an embodiment of the present application.

Fig. 4 is a flowchart of an implementation method for processing the connection vector and the embedded vector of the voice through a deep learning model to obtain a noise vector according to an embodiment of the present application.

Fig. 5 is a schematic process diagram of an image generating method according to an embodiment of the present application.

Fig. 6 is a schematic structural diagram of an image generating apparatus according to an embodiment of the present application.

The element numbers illustrate 10 electronic devices, 110 memories, 120 processors, 1210 central processing units, 1220 neural network processors, 12210 neural network implementation engines, 12220 dedicated hardware circuits, 122210 matrix computing units, 122220 vector computing units, 600 voice-based image generating devices, 610 voice acquisition modules, 620 voice processing modules, 630 sample acquisition modules, 640 connection processing modules, 650 noise acquisition modules, 660 image acquisition modules, S11-S16 steps, S1-S2 steps, and S21-S22 steps.

Detailed Description

Other advantages and effects of the present application will become apparent to those skilled in the art from the present disclosure, when the following description of the embodiments is taken in conjunction with the accompanying drawings. The present application may be embodied or carried out in other specific embodiments, and the details of the present application may be modified or changed from various points of view and applications without departing from the spirit of the present application. It should be noted that the following embodiments and features in the embodiments may be combined with each other without conflict.

It should be noted that, the illustrations provided in the following embodiments merely illustrate the basic concepts of the application by way of illustration, and only the components related to the application are shown in the drawings and are not drawn according to the number, shape and size of the components in actual implementation, and the form, number and proportion of the components in actual implementation may be arbitrarily changed, and the layout of the components may be more complex.

The following describes the technical solutions in the embodiments of the present application in detail with reference to the drawings in the embodiments of the present application.

The image generation method provided by the embodiment of the application can be operated in the electronic equipment. Taking fig. 1 as an example, fig. 1 is a block diagram of a hardware structure of an electronic device in the image generating method. The electronic device 10 comprises a memory 110 and a processor 120, which processor 120 may be a central processor 1210 or a dedicated neural network processor 1220, which neural network processor 1220 comprises a neural network implementation engine 12210 and dedicated hardware circuits 12220, which dedicated hardware circuits 12220 comprise a matrix calculation unit 122210 and a vector calculation unit 122220.

Alternatively, the neural network processor 1220 is a processor that performs neural network computation using a dedicated hardware circuit 12220, the dedicated hardware circuit 12220 being an integrated circuit for performing neural network computation and including a matrix computation unit 122210 and a vector computation unit 122220 that perform vector-matrix multiplication in hardware.

Optionally, the neural network implementation engine 12210 is configured to generate instructions for execution by the dedicated hardware circuit 12220, which when executed by the dedicated hardware circuit 12220, cause the dedicated hardware circuit 12220 to perform operations specified by a neural network to generate a neural network output from the received neural network input.

As shown in fig. 2, the present embodiment provides a voice-based image generation method, which may be implemented by a processor of a computer device, the image generation method including:

s11, acquiring voice for generating a target image.

S12, processing the voice through an audio encoder to obtain an embedded vector of the voice.

Alternatively, the audio encoder may refer to a model based on a deep convolutional neural network or a recurrent neural network. The embedded vector of speech may be a vector that captures semantic and speech features of the audio, and the embedded vector of speech may be a low-dimensional vector. The audio encoder, which may be referred to as wav2vec2.0, is a speech representation learning model based on self-supervised learning, which aims to convert speech signals into corresponding high quality speech representations, i.e. to speech encode waveforms for speech recognition, speech understanding and other speech related tasks.

S13, obtaining an embedded vector of a sample, wherein the sample is randomly sampled from Gaussian distribution.

Alternatively, the embedded vector of samples may refer to a representation of the samples mapped into a vector space. The sample may be a sample randomly sampled from a gaussian distribution.

S14, connecting the embedded vector of the voice and the embedded vector of the sample to obtain a connection vector.

Alternatively, the connection process may refer to connecting the embedded vector of the voice and the embedded vector of the sample, and in particular, the connection vector may be formed by connecting the embedded vector of the voice and the embedded vector of the sample in a specific dimension using a concatenation operation.

S15, processing the connection vector and the embedded vector of the voice through a deep learning model to obtain a noise vector and obtain a denoising vector of the connection vector based on the noise vector and the connection vector, wherein the deep learning model consists of an encoder and a first decoder based on a cross attention mechanism.

Alternatively, the noise vector may refer to a noise vector with respect to the connection vector, and the denoising vector may refer to a vector of the connection vector from which noise is removed. The cross-attention mechanism is a variation of the attention mechanism for handling associations and interactions between multiple input sequences. It is commonly used in natural language processing and computer vision tasks, particularly in sequence-to-sequence models and attention models.

Alternatively, the deep learning model may be a U2Net model.

S16, processing the denoising vector of the connection vector through a second decoder to acquire the target image.

Alternatively, the first decoder of the second decoder may refer to a different type of decoder. The second Decoder may be a Decoder in a variant automatic encoder, which may be a VAE Decoder, i.e. a VAE-Decoder.

As can be seen from the above description, the image generating method according to the present embodiment includes: acquiring voice for generating a target image;

processing the voice through an audio encoder to obtain an embedded vector of the voice; obtaining an embedded vector of a sample, wherein the sample is randomly sampled from Gaussian distribution; connecting the embedded vector of the voice and the embedded vector of the sample to obtain a connection vector; processing the connection vector and the embedded vector of the voice through a deep learning model to obtain a noise vector and a denoising vector of the connection vector based on the noise vector and the connection vector, wherein the deep learning model comprises an encoder and a first decoder based on a cross-attention mechanism; and processing the denoising vector of the connection vector through a second decoder to acquire the target image.

As shown in fig. 3, the present embodiment provides an implementation method for processing the connection vector and the embedded vector of the voice through a deep learning model to obtain a noise vector and obtain a denoising vector of the connection vector based on the noise vector and the connection vector, including:

s1: and processing the connection vector and the embedded vector of the voice through the deep learning model to obtain a noise vector and obtaining a denoising vector of the connection vector based on the noise vector and the connection vector.

Optionally, the denoising vector of the connection vector is a vector obtained by subtracting the connection vector from the noise vector.

Alternatively, the preset iteration number may be flexibly set according to practical situations, for example, the preset iteration number may be set to 500. The initial value of the current iteration number may be 1.

Optionally, the noise vector is a noise vector related to the connection vector, when turning from S2 to S1, the connection vector in S1 is an updated connection vector, the noise vector in S1 is a noise vector related to the updated connection vector, and the denoising vector of the connection vector in S1 may refer to a vector obtained by subtracting the updated connection vector from the noise vector. For example, the connection vector is u, the noise vector is v, and the denoising vector of the connection vector is u-v.

As can be seen from the above description, the implementation method of processing the connection vector and the embedded vector of the voice through the deep learning model to obtain a noise vector and obtain a denoising vector of the connection vector based on the noise vector and the connection vector according to the present embodiment includes: s1: processing the connection vector and the embedded vector of the voice through the deep learning model to obtain a noise vector and a denoising vector of the connection vector based on the noise vector and the connection vector; s2: if the current iteration number is smaller than the preset iteration number, turning to S1, wherein the connection vector in S1 is an updated connection vector, the updated connection vector is a denoising vector of the connection vector, and otherwise, obtaining the denoising vector of the connection vector.

In the method, the processes of S1 and S2 are repeatedly executed by setting the preset iteration times, so that the connection vector with the noise completely removed is obtained, and the precision of finally generating the target image is improved.

As shown in fig. 4, the implementation method for processing the connection vector and the embedded vector of the voice through the deep learning model to obtain the noise vector includes:

s21, performing first cross fusion processing on the connection vector and the embedded vector of the voice through the stacked encoders to obtain a fusion vector.

Optionally, the first cross fusion process may refer to a process that the encoder performs a combination according to features of at least two vectors, and a new feature vector generated after the combination may be the fusion vector.

Optionally, the implementation method for performing cross fusion processing on the connection vector and the embedded vector of the voice through the stacked encoders to obtain a fusion vector includes: and performing first cross fusion processing on a first query vector, a first value vector and a first key vector through the stacked encoders to obtain the fusion vector, wherein the first query vector and the first key vector are both connection vectors, and the first value vector is an embedded vector of the voice.

Optionally, the stacked encoders may include a forward encoder and a backward encoder, where the backward encoder is a first encoder after the forward encoder, and the first query vector, the first value vector, and the first key vector are subjected to a first cross fusion process by the stacked encoders to obtain an input of the backward encoder, such as an input of the forward encoder is the connection vector and an embedded vector of the voice, and an output of the forward encoder is a fusion vector of an output of the forward encoder, and the fusion vector of the output of the forward encoder and the connection vector are the input of the backward encoder.

Optionally, the encoder is implemented based on a cross-attention mechanism, and performing, by the stacked encoder, a first cross-fusion process on the first query vector, the first value vector, and the first key vector may refer to performing, by the stacked encoder, a first cross-fusion process on the first query vector, the first value vector, and the first key vector under the cross-attention mechanism. The query vector under the cross-attention mechanism may be a feature vector used to represent a target or query of interest to the current attention mechanism, the value vector may be a vector containing feature information, and the key vector may be a vector used to measure similarity between the query vector and the value vector.

S22, performing second cross fusion processing on the fusion vector and the connection vector through the stacked first decoder to acquire the noise vector.

Optionally, the deep learning model includes the stacked encoder and the stacked first decoder, the stacked encoder may specifically refer to a stacked encoder in a U2Net model, and the stacked first decoder may specifically refer to a stacked decoder in a U2Net model.

Optionally, the second cross fusion process may refer to a process that the first decoder performs a combination according to features of at least two vectors, and a new feature vector generated after the combination may be the noise vector. Since the processes of feature combining of vectors by the encoder and the decoder may be different, the cross-fusion process of the encoder and the first decoder is distinguished by the first cross-fusion process and the second cross-fusion process.

Optionally, the implementation method for obtaining the noise vector by performing a second cross fusion process on the fusion vector and the connection vector through the stacked first decoder includes: and performing second cross fusion processing on a second query vector, a second value vector and a second key vector through the first decoder to acquire the noise vector, wherein the second query vector and the second key vector are both the connection vector, and the second value vector is the fusion vector.

Optionally, the deep learning model includes the stacked encoders and the stacked first decoders. The principle of operation of the forward decoder and the backward decoder in the first decoder of the stack may be similar to that of the forward encoder and the backward encoder described above and will not be described again here.

Optionally, the first decoder is implemented based on a cross-attention mechanism, and performing, by the first decoder, the second cross-fusion process on the second query vector, the second value vector, and the second key vector may refer to performing, by the first decoder, the second cross-fusion process on the second query vector, the second value vector, and the second key vector under the cross-attention mechanism.

The first query vector and the second query vector are only used for representing different query vectors, and the first value vector, the second value vector, the first key vector and the second key vector are the same and are not described in detail herein.

For a clear illustration of the process of the image generation method, refer to fig. 5. Q1, K1, V1 in fig. 5 represent a first query vector, a first key vector, and a first value vector, respectively, and Q2, K2, V2 represent a second query vector, a second key vector, and a second value vector, respectively.

The protection scope of the image generating method according to the embodiment of the present application is not limited to the execution sequence of the steps listed in the embodiment, and all the schemes implemented by adding or removing steps and replacing steps according to the prior art made by the principles of the present application are included in the protection scope of the present application.

As shown in fig. 6, the present embodiment provides a voice-based image generation apparatus 600, the image generation apparatus 600 including:

the voice acquisition module 610 is configured to acquire voice for generating a target image.

The speech processing module 620 is configured to process the speech through an audio encoder to obtain an embedded vector of the speech.

The sample acquiring module 630 is configured to acquire an embedded vector of samples, where the samples are randomly sampled from a gaussian distribution.

And the connection processing module 640 is configured to perform connection processing on the embedded vector of the voice and the embedded vector of the sample, so as to obtain a connection vector.

A noise acquisition module 650, configured to process the connection vector and the embedded vector of the speech through a deep learning model to acquire a noise vector and acquire a denoising vector of the connection vector based on the noise vector and the connection vector, where the deep learning model is composed of an encoder and a first decoder based on a cross-attention mechanism.

An image acquisition module 660, configured to process, by a second decoder, the denoising vector of the connection vector to acquire the target image.

In the image generating apparatus 600 provided in this embodiment, the voice acquiring module 610 corresponds to step S11, the voice processing module 620 corresponds to step S12, the sample acquiring module 630 corresponds to step S13, the connection processing module 640 corresponds to step S14, the noise acquiring module 650 corresponds to step S15, and the image acquiring module 660 corresponds to step S16 of the image generating method shown in fig. 1.

In the several embodiments provided in this application, it should be understood that the disclosed apparatus or method may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of modules/units is merely a logical function division, and there may be additional divisions when actually implemented, e.g., multiple modules or units may be combined or integrated into another system, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or modules or units, which may be in electrical, mechanical or other forms.

The modules/units illustrated as separate components may or may not be physically separate, and components shown as modules/units may or may not be physical modules, i.e., may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules/units may be selected according to actual needs to achieve the purposes of the embodiments of the present application. For example, functional modules/units in various embodiments of the present application may be integrated into one processing module, or each module/unit may exist alone physically, or two or more modules/units may be integrated into one module/unit.

Those of ordinary skill would further appreciate that the elements and algorithm steps of the examples described in connection with the embodiments disclosed herein may be embodied in electronic hardware, in computer software, or in a combination of the two, and that the elements and steps of the examples have been generally described in terms of function in the foregoing description to clearly illustrate the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

The embodiment provides an electronic device, which comprises a memory, wherein a computer program is stored in the memory; and the processor is in communication connection with the memory and executes the image generation method shown in fig. 2 when the computer program is called.

Embodiments of the present application also provide a computer-readable storage medium. Those of ordinary skill in the art will appreciate that all or part of the steps in the method implementing the above embodiments may be implemented by a program to instruct a processor, where the program may be stored in a computer readable storage medium, where the storage medium is a non-transitory (non-transitory) medium, such as a random access memory, a read only memory, a flash memory, a hard disk, a solid state disk, a magnetic tape (magnetic tape), a floppy disk (floppy disk), an optical disk (optical disk), and any combination thereof. The storage media may be any available media that can be accessed by a computer or a data storage device such as a server, data center, or the like that contains an integration of one or more available media. The usable medium may be a magnetic medium (e.g., a floppy disk, a hard disk, a magnetic tape), an optical medium (e.g., a digital video disc (digital video disc, DVD)), or a semiconductor medium (e.g., a Solid State Disk (SSD)), or the like.

Embodiments of the present application may also provide a computer program product comprising one or more computer instructions. When the computer instructions are loaded and executed on a computing device, the processes or functions described in accordance with the embodiments of the present application are produced in whole or in part. The computer instructions may be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be transmitted from one website, computer, or data center to another website, computer, or data center by a wired (e.g., coaxial cable, fiber optic, digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.).

The computer program product is executed by a computer, which performs the method according to the preceding method embodiment. The computer program product may be a software installation package, which may be downloaded and executed on a computer in case the aforementioned method is required.

The descriptions of the processes or structures corresponding to the drawings have emphasis, and the descriptions of other processes or structures may be referred to for the parts of a certain process or structure that are not described in detail.

The foregoing embodiments are merely illustrative of the principles of the present application and their effectiveness, and are not intended to limit the application. Modifications and variations may be made to the above-described embodiments by those of ordinary skill in the art without departing from the spirit and scope of the present application. Accordingly, it is intended that all equivalent modifications and variations which may be accomplished by persons skilled in the art without departing from the spirit and technical spirit of the disclosure be covered by the claims of this application.

Claims

1. A voice-based image generation method, the image generation method comprising:

acquiring voice for generating a target image;

processing the voice through an audio encoder to obtain an embedded vector of the voice;

obtaining an embedded vector of a sample, wherein the sample is randomly sampled from Gaussian distribution;

connecting the embedded vector of the voice and the embedded vector of the sample to obtain a connection vector;

processing the connection vector and the embedded vector of the voice through a deep learning model to obtain a noise vector and a denoising vector of the connection vector based on the noise vector and the connection vector, wherein the deep learning model comprises an encoder and a first decoder based on a cross-attention mechanism;

and processing the denoising vector of the connection vector through a second decoder to acquire the target image.

2. The image generation method according to claim 1, wherein the implementation method of processing the connection vector and the embedded vector of the speech by a deep learning model to acquire a noise vector and acquiring a denoised vector of the connection vector based on the noise vector and the connection vector comprises:

3. The image generation method according to claim 1, wherein the implementation method of processing the connection vector and the embedded vector of the speech by a deep learning model to obtain a noise vector comprises:

performing first cross fusion processing on the connection vector and the embedded vector of the voice through the stacked encoders to obtain a fusion vector;

and performing second cross fusion processing on the fusion vector and the connection vector through the stacked first decoder so as to acquire the noise vector.

4. The image generation method according to claim 3, wherein the implementation method of performing a first cross fusion process on the connection vector and the embedded vector of the speech by the stacked encoders to obtain a fusion vector comprises: and performing first cross fusion processing on a first query vector, a first value vector and a first key vector through the stacked encoders to obtain the fusion vector, wherein the first query vector and the first key vector are both connection vectors, and the first value vector is an embedded vector of the voice.

5. The image generation method according to claim 3, wherein the implementation method of performing a second cross-fusion process on the fusion vector and the connection vector by the stacked first decoder to acquire the noise vector comprises: and performing second cross fusion processing on a second query vector, a second value vector and a second key vector through the stacked first decoder to acquire the noise vector, wherein the second query vector and the second key vector are both the connection vector, and the second value vector is the fusion vector.

6. The image generation method according to claim 1, wherein the denoising vector of the connection vector is a vector obtained by subtracting the connection vector from the noise vector.

7. The image generation method according to claim 1, wherein the second decoder is a decoder in a variational automatic encoder.

8. A speech-based image generation apparatus, characterized in that the image generation apparatus comprises:

a voice acquisition module for acquiring voice for generating a target image;

the voice processing module is used for processing the voice through the audio encoder so as to acquire an embedded vector of the voice;

the sample acquisition module is used for acquiring an embedded vector of a sample, wherein the sample is randomly sampled from Gaussian distribution;

the connection processing module is used for carrying out connection processing on the embedded vector of the voice and the embedded vector of the sample so as to obtain a connection vector;

a noise acquisition module for processing the connection vector and the embedded vector of the speech through a deep learning model to acquire a noise vector and a denoising vector of the connection vector based on the noise vector and the connection vector, the deep learning model comprising an encoder and a first decoder based on a cross-attention mechanism;

and the image acquisition module is used for processing the denoising vector of the connection vector through a second decoder so as to acquire the target image.

9. A computer-readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the image generation method of any of claims 1-7.

10. An electronic device, the electronic device comprising:

a memory storing a computer program;

a processor in communication with the memory, which when invoked performs the image generation method of any one of claims 1-7.