CN117292024A - Voice-based image generation method and device, medium and electronic equipment - Google Patents

Voice-based image generation method and device, medium and electronic equipment Download PDF

Info

Publication number
CN117292024A
CN117292024A CN202311580355.8A CN202311580355A CN117292024A CN 117292024 A CN117292024 A CN 117292024A CN 202311580355 A CN202311580355 A CN 202311580355A CN 117292024 A CN117292024 A CN 117292024A
Authority
CN
China
Prior art keywords
vector
connection
voice
embedded
processing
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202311580355.8A
Other languages
Chinese (zh)
Other versions
CN117292024B (en
Inventor
孔欧
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Mido Technology Co ltd
Original Assignee
Shanghai Mido Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Mido Technology Co ltd filed Critical Shanghai Mido Technology Co ltd
Priority to CN202311580355.8A priority Critical patent/CN117292024B/en
Publication of CN117292024A publication Critical patent/CN117292024A/en
Application granted granted Critical
Publication of CN117292024B publication Critical patent/CN117292024B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T13/00Animation
    • G06T13/203D [Three Dimensional] animation
    • G06T13/2053D [Three Dimensional] animation driven by audio data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/16Sound input; Sound output
    • G06F3/165Management of the audio stream, e.g. setting of volume, audio stream path
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • G06N3/0455Auto-encoder networks; Encoder-decoder networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T5/00Image enhancement or restoration
    • G06T5/50Image enhancement or restoration using two or more images, e.g. averaging or subtraction
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20212Image combination
    • G06T2207/20221Image fusion; Image merging
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Computing Systems (AREA)
  • Biomedical Technology (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Molecular Biology (AREA)
  • Biophysics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Image Processing (AREA)

Abstract

The application provides a voice-based image generation method, a voice-based image generation device, a voice-based image generation medium and electronic equipment. The image generation method comprises the following steps: acquiring voice for generating a target image; processing the voice through an audio encoder to obtain an embedded vector of the voice; obtaining an embedded vector of a sample; connecting the embedded vector of the voice and the embedded vector of the sample to obtain a connection vector; processing the connection vector and the embedded vector of the voice through a deep learning model to obtain a noise vector and a denoising vector of the connection vector based on the noise vector and the connection vector, wherein the deep learning model comprises an encoder and a first decoder based on a cross-attention mechanism; the de-noised vector is processed by a second decoder to obtain the target image. The image generation method can reduce the complexity of the whole image generation process and can improve the accuracy of the generated target image.

Description

Voice-based image generation method and device, medium and electronic equipment
Technical Field
The present application relates to a method for generating an image based on voice, and in particular, to a method, an apparatus, a medium, and an electronic device for generating an image based on voice.
Background
Speech recognition is a technology that has been developed rapidly in recent years, and is now widely used in daily life production of people, text generation recording is performed by speech, operation instruction recognition, and the like.
At present, an Image generating method based on voice is generally generated by a task cascading method, for example, an ASR (Automatic Speech Recognition, automatic voice recognition) +text2Image (Text-to-Image) cascading method is used for generating, and because the method at least needs a plurality of models, the method has the problems of higher complexity of the whole Image generating process and lower accuracy of generated images.
Disclosure of Invention
The invention aims to provide a voice-based image generation method, a voice-based image generation device, a voice-based image generation medium and an electronic device, which are used for solving the problems of high complexity of an overall image generation process and low accuracy of a generated image in the existing image generation method.
In a first aspect, the present application provides a voice-based image generation method, the image generation method including: acquiring voice for generating a target image; processing the voice through an audio encoder to obtain an embedded vector of the voice; obtaining an embedded vector of a sample, wherein the sample is randomly sampled from Gaussian distribution; connecting the embedded vector of the voice and the embedded vector of the sample to obtain a connection vector; processing the connection vector and the embedded vector of the voice through a deep learning model to obtain a noise vector and a denoising vector of the connection vector based on the noise vector and the connection vector, wherein the deep learning model comprises an encoder and a first decoder based on a cross-attention mechanism; and processing the denoising vector of the connection vector through a second decoder to acquire the target image.
In the image generation method, the image can be directly generated according to the voice only by introducing the deep learning model, the cascade of tasks is not involved, the complexity of the whole image generation process is low, and the cascade of tasks is not involved, so that the problems of information loss and distortion easily existing between different tasks are avoided, and the precision of the finally generated target image is high.
In an embodiment of the present application, the implementation method for processing the connection vector and the embedded vector of the voice through a deep learning model to obtain a noise vector and obtain a denoising vector of the connection vector based on the noise vector and the connection vector includes:
s1: processing the connection vector and the embedded vector of the voice through the deep learning model to obtain a noise vector and a denoising vector of the connection vector based on the noise vector and the connection vector;
s2: if the current iteration number is smaller than the preset iteration number, turning to S1, adding 1 to the value of the current iteration number, wherein the connection vector in S1 is an updated connection vector, the updated connection vector is a denoising vector of the connection vector, and otherwise, acquiring the denoising vector of the connection vector.
In an embodiment of the present application, the implementation method for processing the connection vector and the embedded vector of the speech through the deep learning model to obtain the noise vector includes: performing first cross fusion processing on the connection vector and the embedded vector of the voice through the stacked encoders to obtain a fusion vector; and performing second cross fusion processing on the fusion vector and the connection vector through the stacked first decoder so as to acquire the noise vector.
In an embodiment of the present application, the implementation method for obtaining the fusion vector by performing a first cross fusion process on the connection vector and the embedded vector of the speech through the stacked encoders includes: and performing first cross fusion processing on a first query vector, a first value vector and a first key vector through the stacked encoders to obtain the fusion vector, wherein the first query vector and the first key vector are both connection vectors, and the first value vector is an embedded vector of the voice.
In an embodiment of the present application, the implementation method for obtaining the noise vector by performing a second cross fusion process on the fusion vector and the connection vector by using the stacked first decoder includes: and performing second cross fusion processing on a second query vector, a second value vector and a second key vector through the stacked first decoder to acquire the noise vector, wherein the second query vector and the second key vector are both the connection vector, and the second value vector is the fusion vector.
In an embodiment of the present application, the denoising vector of the connection vector is a vector obtained by subtracting the connection vector from the noise vector.
In an embodiment of the present application, the second decoder is a decoder in a variant automatic encoder.
In a second aspect, the present application provides a voice-based image generation apparatus, the image generation apparatus comprising: a voice acquisition module for acquiring voice for generating a target image; the voice processing module is used for processing the voice through the audio encoder so as to acquire an embedded vector of the voice; the sample acquisition module is used for acquiring an embedded vector of a sample, wherein the sample is randomly sampled from Gaussian distribution; the connection processing module is used for carrying out connection processing on the embedded vector of the voice and the embedded vector of the sample so as to obtain a connection vector; the noise acquisition module is used for processing the connection vector and the embedded vector of the voice through a deep learning model to acquire a noise vector and a denoising vector of the connection vector based on the noise vector and the connection vector, wherein the deep learning model consists of an encoder and a first decoder based on a cross attention mechanism; and the image acquisition module is used for processing the denoising vector of the connection vector through a second decoder so as to acquire the target image.
In a third aspect, the present application provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the image generation method according to any of the first aspects of the present application.
In a fourth aspect, the present application provides an electronic device, including: a memory storing a computer program; and the processor is in communication connection with the memory and executes the image generation method according to any one of the first aspect of the application when the computer program is called.
As described above, the method, the device, the medium and the electronic device for generating the voice-based image have the following beneficial effects:
in the image generation method, the image can be directly generated according to the voice only by introducing the deep learning model, the cascade of tasks is not involved, the complexity of the whole image generation process is low, and the cascade of tasks is not involved, so that the problems of information loss and distortion easily existing between different tasks are avoided, and the precision of the finally generated target image is high.
Drawings
Fig. 1 is a schematic diagram of a hardware structure for running the image generating method according to the embodiment of the present application.
Fig. 2 is a flowchart of an image generating method according to an embodiment of the present application.
Fig. 3 is a flowchart of an implementation method for processing the connection vector and the embedded vector of the voice through a deep learning model to obtain a noise vector and obtain a denoising vector of the connection vector based on the noise vector and the connection vector according to an embodiment of the present application.
Fig. 4 is a flowchart of an implementation method for processing the connection vector and the embedded vector of the voice through a deep learning model to obtain a noise vector according to an embodiment of the present application.
Fig. 5 is a schematic process diagram of an image generating method according to an embodiment of the present application.
Fig. 6 is a schematic structural diagram of an image generating apparatus according to an embodiment of the present application.
The element numbers illustrate 10 electronic devices, 110 memories, 120 processors, 1210 central processing units, 1220 neural network processors, 12210 neural network implementation engines, 12220 dedicated hardware circuits, 122210 matrix computing units, 122220 vector computing units, 600 voice-based image generating devices, 610 voice acquisition modules, 620 voice processing modules, 630 sample acquisition modules, 640 connection processing modules, 650 noise acquisition modules, 660 image acquisition modules, S11-S16 steps, S1-S2 steps, and S21-S22 steps.
Detailed Description
Other advantages and effects of the present application will become apparent to those skilled in the art from the present disclosure, when the following description of the embodiments is taken in conjunction with the accompanying drawings. The present application may be embodied or carried out in other specific embodiments, and the details of the present application may be modified or changed from various points of view and applications without departing from the spirit of the present application. It should be noted that the following embodiments and features in the embodiments may be combined with each other without conflict.
It should be noted that, the illustrations provided in the following embodiments merely illustrate the basic concepts of the application by way of illustration, and only the components related to the application are shown in the drawings and are not drawn according to the number, shape and size of the components in actual implementation, and the form, number and proportion of the components in actual implementation may be arbitrarily changed, and the layout of the components may be more complex.
The following describes the technical solutions in the embodiments of the present application in detail with reference to the drawings in the embodiments of the present application.
The image generation method provided by the embodiment of the application can be operated in the electronic equipment. Taking fig. 1 as an example, fig. 1 is a block diagram of a hardware structure of an electronic device in the image generating method. The electronic device 10 comprises a memory 110 and a processor 120, which processor 120 may be a central processor 1210 or a dedicated neural network processor 1220, which neural network processor 1220 comprises a neural network implementation engine 12210 and dedicated hardware circuits 12220, which dedicated hardware circuits 12220 comprise a matrix calculation unit 122210 and a vector calculation unit 122220.
Alternatively, the neural network processor 1220 is a processor that performs neural network computation using a dedicated hardware circuit 12220, the dedicated hardware circuit 12220 being an integrated circuit for performing neural network computation and including a matrix computation unit 122210 and a vector computation unit 122220 that perform vector-matrix multiplication in hardware.
Optionally, the neural network implementation engine 12210 is configured to generate instructions for execution by the dedicated hardware circuit 12220, which when executed by the dedicated hardware circuit 12220, cause the dedicated hardware circuit 12220 to perform operations specified by a neural network to generate a neural network output from the received neural network input.
As shown in fig. 2, the present embodiment provides a voice-based image generation method, which may be implemented by a processor of a computer device, the image generation method including:
s11, acquiring voice for generating a target image.
S12, processing the voice through an audio encoder to obtain an embedded vector of the voice.
Alternatively, the audio encoder may refer to a model based on a deep convolutional neural network or a recurrent neural network. The embedded vector of speech may be a vector that captures semantic and speech features of the audio, and the embedded vector of speech may be a low-dimensional vector. The audio encoder, which may be referred to as wav2vec2.0, is a speech representation learning model based on self-supervised learning, which aims to convert speech signals into corresponding high quality speech representations, i.e. to speech encode waveforms for speech recognition, speech understanding and other speech related tasks.
S13, obtaining an embedded vector of a sample, wherein the sample is randomly sampled from Gaussian distribution.
Alternatively, the embedded vector of samples may refer to a representation of the samples mapped into a vector space. The sample may be a sample randomly sampled from a gaussian distribution.
S14, connecting the embedded vector of the voice and the embedded vector of the sample to obtain a connection vector.
Alternatively, the connection process may refer to connecting the embedded vector of the voice and the embedded vector of the sample, and in particular, the connection vector may be formed by connecting the embedded vector of the voice and the embedded vector of the sample in a specific dimension using a concatenation operation.
S15, processing the connection vector and the embedded vector of the voice through a deep learning model to obtain a noise vector and obtain a denoising vector of the connection vector based on the noise vector and the connection vector, wherein the deep learning model consists of an encoder and a first decoder based on a cross attention mechanism.
Alternatively, the noise vector may refer to a noise vector with respect to the connection vector, and the denoising vector may refer to a vector of the connection vector from which noise is removed. The cross-attention mechanism is a variation of the attention mechanism for handling associations and interactions between multiple input sequences. It is commonly used in natural language processing and computer vision tasks, particularly in sequence-to-sequence models and attention models.
Alternatively, the deep learning model may be a U2Net model.
S16, processing the denoising vector of the connection vector through a second decoder to acquire the target image.
Alternatively, the first decoder of the second decoder may refer to a different type of decoder. The second Decoder may be a Decoder in a variant automatic encoder, which may be a VAE Decoder, i.e. a VAE-Decoder.
As can be seen from the above description, the image generating method according to the present embodiment includes: acquiring voice for generating a target image;
processing the voice through an audio encoder to obtain an embedded vector of the voice; obtaining an embedded vector of a sample, wherein the sample is randomly sampled from Gaussian distribution; connecting the embedded vector of the voice and the embedded vector of the sample to obtain a connection vector; processing the connection vector and the embedded vector of the voice through a deep learning model to obtain a noise vector and a denoising vector of the connection vector based on the noise vector and the connection vector, wherein the deep learning model comprises an encoder and a first decoder based on a cross-attention mechanism; and processing the denoising vector of the connection vector through a second decoder to acquire the target image.
In the image generation method, the image can be directly generated according to the voice only by introducing the deep learning model, the cascade of tasks is not involved, the complexity of the whole image generation process is low, and the cascade of tasks is not involved, so that the problems of information loss and distortion easily existing between different tasks are avoided, and the precision of the finally generated target image is high.
As shown in fig. 3, the present embodiment provides an implementation method for processing the connection vector and the embedded vector of the voice through a deep learning model to obtain a noise vector and obtain a denoising vector of the connection vector based on the noise vector and the connection vector, including:
s1: and processing the connection vector and the embedded vector of the voice through the deep learning model to obtain a noise vector and obtaining a denoising vector of the connection vector based on the noise vector and the connection vector.
Optionally, the denoising vector of the connection vector is a vector obtained by subtracting the connection vector from the noise vector.
S2: if the current iteration number is smaller than the preset iteration number, turning to S1, adding 1 to the value of the current iteration number, wherein the connection vector in S1 is an updated connection vector, the updated connection vector is a denoising vector of the connection vector, and otherwise, acquiring the denoising vector of the connection vector.
Alternatively, the preset iteration number may be flexibly set according to practical situations, for example, the preset iteration number may be set to 500. The initial value of the current iteration number may be 1.
Optionally, the noise vector is a noise vector related to the connection vector, when turning from S2 to S1, the connection vector in S1 is an updated connection vector, the noise vector in S1 is a noise vector related to the updated connection vector, and the denoising vector of the connection vector in S1 may refer to a vector obtained by subtracting the updated connection vector from the noise vector. For example, the connection vector is u, the noise vector is v, and the denoising vector of the connection vector is u-v.
As can be seen from the above description, the implementation method of processing the connection vector and the embedded vector of the voice through the deep learning model to obtain a noise vector and obtain a denoising vector of the connection vector based on the noise vector and the connection vector according to the present embodiment includes: s1: processing the connection vector and the embedded vector of the voice through the deep learning model to obtain a noise vector and a denoising vector of the connection vector based on the noise vector and the connection vector; s2: if the current iteration number is smaller than the preset iteration number, turning to S1, wherein the connection vector in S1 is an updated connection vector, the updated connection vector is a denoising vector of the connection vector, and otherwise, obtaining the denoising vector of the connection vector.
In the method, the processes of S1 and S2 are repeatedly executed by setting the preset iteration times, so that the connection vector with the noise completely removed is obtained, and the precision of finally generating the target image is improved.
As shown in fig. 4, the implementation method for processing the connection vector and the embedded vector of the voice through the deep learning model to obtain the noise vector includes:
s21, performing first cross fusion processing on the connection vector and the embedded vector of the voice through the stacked encoders to obtain a fusion vector.
Optionally, the first cross fusion process may refer to a process that the encoder performs a combination according to features of at least two vectors, and a new feature vector generated after the combination may be the fusion vector.
Optionally, the implementation method for performing cross fusion processing on the connection vector and the embedded vector of the voice through the stacked encoders to obtain a fusion vector includes: and performing first cross fusion processing on a first query vector, a first value vector and a first key vector through the stacked encoders to obtain the fusion vector, wherein the first query vector and the first key vector are both connection vectors, and the first value vector is an embedded vector of the voice.
Optionally, the stacked encoders may include a forward encoder and a backward encoder, where the backward encoder is a first encoder after the forward encoder, and the first query vector, the first value vector, and the first key vector are subjected to a first cross fusion process by the stacked encoders to obtain an input of the backward encoder, such as an input of the forward encoder is the connection vector and an embedded vector of the voice, and an output of the forward encoder is a fusion vector of an output of the forward encoder, and the fusion vector of the output of the forward encoder and the connection vector are the input of the backward encoder.
Optionally, the encoder is implemented based on a cross-attention mechanism, and performing, by the stacked encoder, a first cross-fusion process on the first query vector, the first value vector, and the first key vector may refer to performing, by the stacked encoder, a first cross-fusion process on the first query vector, the first value vector, and the first key vector under the cross-attention mechanism. The query vector under the cross-attention mechanism may be a feature vector used to represent a target or query of interest to the current attention mechanism, the value vector may be a vector containing feature information, and the key vector may be a vector used to measure similarity between the query vector and the value vector.
S22, performing second cross fusion processing on the fusion vector and the connection vector through the stacked first decoder to acquire the noise vector.
Optionally, the deep learning model includes the stacked encoder and the stacked first decoder, the stacked encoder may specifically refer to a stacked encoder in a U2Net model, and the stacked first decoder may specifically refer to a stacked decoder in a U2Net model.
Optionally, the second cross fusion process may refer to a process that the first decoder performs a combination according to features of at least two vectors, and a new feature vector generated after the combination may be the noise vector. Since the processes of feature combining of vectors by the encoder and the decoder may be different, the cross-fusion process of the encoder and the first decoder is distinguished by the first cross-fusion process and the second cross-fusion process.
Optionally, the implementation method for obtaining the noise vector by performing a second cross fusion process on the fusion vector and the connection vector through the stacked first decoder includes: and performing second cross fusion processing on a second query vector, a second value vector and a second key vector through the first decoder to acquire the noise vector, wherein the second query vector and the second key vector are both the connection vector, and the second value vector is the fusion vector.
Optionally, the deep learning model includes the stacked encoders and the stacked first decoders. The principle of operation of the forward decoder and the backward decoder in the first decoder of the stack may be similar to that of the forward encoder and the backward encoder described above and will not be described again here.
Optionally, the first decoder is implemented based on a cross-attention mechanism, and performing, by the first decoder, the second cross-fusion process on the second query vector, the second value vector, and the second key vector may refer to performing, by the first decoder, the second cross-fusion process on the second query vector, the second value vector, and the second key vector under the cross-attention mechanism.
The first query vector and the second query vector are only used for representing different query vectors, and the first value vector, the second value vector, the first key vector and the second key vector are the same and are not described in detail herein.
For a clear illustration of the process of the image generation method, refer to fig. 5. Q1, K1, V1 in fig. 5 represent a first query vector, a first key vector, and a first value vector, respectively, and Q2, K2, V2 represent a second query vector, a second key vector, and a second value vector, respectively.
The protection scope of the image generating method according to the embodiment of the present application is not limited to the execution sequence of the steps listed in the embodiment, and all the schemes implemented by adding or removing steps and replacing steps according to the prior art made by the principles of the present application are included in the protection scope of the present application.
As shown in fig. 6, the present embodiment provides a voice-based image generation apparatus 600, the image generation apparatus 600 including:
the voice acquisition module 610 is configured to acquire voice for generating a target image.
The speech processing module 620 is configured to process the speech through an audio encoder to obtain an embedded vector of the speech.
The sample acquiring module 630 is configured to acquire an embedded vector of samples, where the samples are randomly sampled from a gaussian distribution.
And the connection processing module 640 is configured to perform connection processing on the embedded vector of the voice and the embedded vector of the sample, so as to obtain a connection vector.
A noise acquisition module 650, configured to process the connection vector and the embedded vector of the speech through a deep learning model to acquire a noise vector and acquire a denoising vector of the connection vector based on the noise vector and the connection vector, where the deep learning model is composed of an encoder and a first decoder based on a cross-attention mechanism.
An image acquisition module 660, configured to process, by a second decoder, the denoising vector of the connection vector to acquire the target image.
In the image generating apparatus 600 provided in this embodiment, the voice acquiring module 610 corresponds to step S11, the voice processing module 620 corresponds to step S12, the sample acquiring module 630 corresponds to step S13, the connection processing module 640 corresponds to step S14, the noise acquiring module 650 corresponds to step S15, and the image acquiring module 660 corresponds to step S16 of the image generating method shown in fig. 1.
In the several embodiments provided in this application, it should be understood that the disclosed apparatus or method may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of modules/units is merely a logical function division, and there may be additional divisions when actually implemented, e.g., multiple modules or units may be combined or integrated into another system, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or modules or units, which may be in electrical, mechanical or other forms.
The modules/units illustrated as separate components may or may not be physically separate, and components shown as modules/units may or may not be physical modules, i.e., may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules/units may be selected according to actual needs to achieve the purposes of the embodiments of the present application. For example, functional modules/units in various embodiments of the present application may be integrated into one processing module, or each module/unit may exist alone physically, or two or more modules/units may be integrated into one module/unit.
Those of ordinary skill would further appreciate that the elements and algorithm steps of the examples described in connection with the embodiments disclosed herein may be embodied in electronic hardware, in computer software, or in a combination of the two, and that the elements and steps of the examples have been generally described in terms of function in the foregoing description to clearly illustrate the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.
The embodiment provides an electronic device, which comprises a memory, wherein a computer program is stored in the memory; and the processor is in communication connection with the memory and executes the image generation method shown in fig. 2 when the computer program is called.
Embodiments of the present application also provide a computer-readable storage medium. Those of ordinary skill in the art will appreciate that all or part of the steps in the method implementing the above embodiments may be implemented by a program to instruct a processor, where the program may be stored in a computer readable storage medium, where the storage medium is a non-transitory (non-transitory) medium, such as a random access memory, a read only memory, a flash memory, a hard disk, a solid state disk, a magnetic tape (magnetic tape), a floppy disk (floppy disk), an optical disk (optical disk), and any combination thereof. The storage media may be any available media that can be accessed by a computer or a data storage device such as a server, data center, or the like that contains an integration of one or more available media. The usable medium may be a magnetic medium (e.g., a floppy disk, a hard disk, a magnetic tape), an optical medium (e.g., a digital video disc (digital video disc, DVD)), or a semiconductor medium (e.g., a Solid State Disk (SSD)), or the like.
Embodiments of the present application may also provide a computer program product comprising one or more computer instructions. When the computer instructions are loaded and executed on a computing device, the processes or functions described in accordance with the embodiments of the present application are produced in whole or in part. The computer instructions may be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be transmitted from one website, computer, or data center to another website, computer, or data center by a wired (e.g., coaxial cable, fiber optic, digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.).
The computer program product is executed by a computer, which performs the method according to the preceding method embodiment. The computer program product may be a software installation package, which may be downloaded and executed on a computer in case the aforementioned method is required.
The descriptions of the processes or structures corresponding to the drawings have emphasis, and the descriptions of other processes or structures may be referred to for the parts of a certain process or structure that are not described in detail.
The foregoing embodiments are merely illustrative of the principles of the present application and their effectiveness, and are not intended to limit the application. Modifications and variations may be made to the above-described embodiments by those of ordinary skill in the art without departing from the spirit and scope of the present application. Accordingly, it is intended that all equivalent modifications and variations which may be accomplished by persons skilled in the art without departing from the spirit and technical spirit of the disclosure be covered by the claims of this application.

Claims (10)

1. A voice-based image generation method, the image generation method comprising:
acquiring voice for generating a target image;
processing the voice through an audio encoder to obtain an embedded vector of the voice;
obtaining an embedded vector of a sample, wherein the sample is randomly sampled from Gaussian distribution;
connecting the embedded vector of the voice and the embedded vector of the sample to obtain a connection vector;
processing the connection vector and the embedded vector of the voice through a deep learning model to obtain a noise vector and a denoising vector of the connection vector based on the noise vector and the connection vector, wherein the deep learning model comprises an encoder and a first decoder based on a cross-attention mechanism;
and processing the denoising vector of the connection vector through a second decoder to acquire the target image.
2. The image generation method according to claim 1, wherein the implementation method of processing the connection vector and the embedded vector of the speech by a deep learning model to acquire a noise vector and acquiring a denoised vector of the connection vector based on the noise vector and the connection vector comprises:
s1: processing the connection vector and the embedded vector of the voice through the deep learning model to obtain a noise vector and a denoising vector of the connection vector based on the noise vector and the connection vector;
s2: if the current iteration number is smaller than the preset iteration number, turning to S1, adding 1 to the value of the current iteration number, wherein the connection vector in S1 is an updated connection vector, the updated connection vector is a denoising vector of the connection vector, and otherwise, acquiring the denoising vector of the connection vector.
3. The image generation method according to claim 1, wherein the implementation method of processing the connection vector and the embedded vector of the speech by a deep learning model to obtain a noise vector comprises:
performing first cross fusion processing on the connection vector and the embedded vector of the voice through the stacked encoders to obtain a fusion vector;
and performing second cross fusion processing on the fusion vector and the connection vector through the stacked first decoder so as to acquire the noise vector.
4. The image generation method according to claim 3, wherein the implementation method of performing a first cross fusion process on the connection vector and the embedded vector of the speech by the stacked encoders to obtain a fusion vector comprises: and performing first cross fusion processing on a first query vector, a first value vector and a first key vector through the stacked encoders to obtain the fusion vector, wherein the first query vector and the first key vector are both connection vectors, and the first value vector is an embedded vector of the voice.
5. The image generation method according to claim 3, wherein the implementation method of performing a second cross-fusion process on the fusion vector and the connection vector by the stacked first decoder to acquire the noise vector comprises: and performing second cross fusion processing on a second query vector, a second value vector and a second key vector through the stacked first decoder to acquire the noise vector, wherein the second query vector and the second key vector are both the connection vector, and the second value vector is the fusion vector.
6. The image generation method according to claim 1, wherein the denoising vector of the connection vector is a vector obtained by subtracting the connection vector from the noise vector.
7. The image generation method according to claim 1, wherein the second decoder is a decoder in a variational automatic encoder.
8. A speech-based image generation apparatus, characterized in that the image generation apparatus comprises:
a voice acquisition module for acquiring voice for generating a target image;
the voice processing module is used for processing the voice through the audio encoder so as to acquire an embedded vector of the voice;
the sample acquisition module is used for acquiring an embedded vector of a sample, wherein the sample is randomly sampled from Gaussian distribution;
the connection processing module is used for carrying out connection processing on the embedded vector of the voice and the embedded vector of the sample so as to obtain a connection vector;
a noise acquisition module for processing the connection vector and the embedded vector of the speech through a deep learning model to acquire a noise vector and a denoising vector of the connection vector based on the noise vector and the connection vector, the deep learning model comprising an encoder and a first decoder based on a cross-attention mechanism;
and the image acquisition module is used for processing the denoising vector of the connection vector through a second decoder so as to acquire the target image.
9. A computer-readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the image generation method of any of claims 1-7.
10. An electronic device, the electronic device comprising:
a memory storing a computer program;
a processor in communication with the memory, which when invoked performs the image generation method of any one of claims 1-7.
CN202311580355.8A 2023-11-24 2023-11-24 Voice-based image generation method and device, medium and electronic equipment Active CN117292024B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311580355.8A CN117292024B (en) 2023-11-24 2023-11-24 Voice-based image generation method and device, medium and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311580355.8A CN117292024B (en) 2023-11-24 2023-11-24 Voice-based image generation method and device, medium and electronic equipment

Publications (2)

Publication Number Publication Date
CN117292024A true CN117292024A (en) 2023-12-26
CN117292024B CN117292024B (en) 2024-04-12

Family

ID=89244727

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311580355.8A Active CN117292024B (en) 2023-11-24 2023-11-24 Voice-based image generation method and device, medium and electronic equipment

Country Status (1)

Country Link
CN (1) CN117292024B (en)

Citations (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7168953B1 (en) * 2003-01-27 2007-01-30 Massachusetts Institute Of Technology Trainable videorealistic speech animation
KR20190135853A (en) * 2018-05-29 2019-12-09 한국과학기술원 Method and system of text to multiple speech
US20220084273A1 (en) * 2020-09-12 2022-03-17 Jingdong Digits Technology Holding Co., Ltd. System and method for synthesizing photo-realistic video of a speech
US20220358703A1 (en) * 2019-06-21 2022-11-10 Deepbrain Ai Inc. Method and device for generating speech video on basis of machine learning
US20220375190A1 (en) * 2020-08-25 2022-11-24 Deepbrain Ai Inc. Device and method for generating speech video
US20220399025A1 (en) * 2019-06-21 2022-12-15 Deepbrain Ai Inc. Method and device for generating speech video using audio signal
CN115643341A (en) * 2022-10-14 2023-01-24 杭州半云科技有限公司 Artificial intelligence customer service response system
CN115758282A (en) * 2022-11-07 2023-03-07 上海蜜度信息技术有限公司 Cross-modal sensitive information identification method, system and terminal
CN115937369A (en) * 2022-11-21 2023-04-07 之江实验室 Expression animation generation method and system, electronic equipment and storage medium
US20230125839A1 (en) * 2021-10-27 2023-04-27 Samsung Sds Co., Ltd. Method and apparatus for generating synthetic data
CN116741197A (en) * 2023-08-11 2023-09-12 上海蜜度信息技术有限公司 Multi-mode image generation method and device, storage medium and electronic equipment
CN116884427A (en) * 2023-05-10 2023-10-13 华中科技大学 Embedded vector processing method based on end-to-end deep learning voice re-etching model
CN116932712A (en) * 2023-06-30 2023-10-24 上海蜜度信息技术有限公司 Multi-mode input interactive information generation method, device, equipment and medium

Patent Citations (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7168953B1 (en) * 2003-01-27 2007-01-30 Massachusetts Institute Of Technology Trainable videorealistic speech animation
KR20190135853A (en) * 2018-05-29 2019-12-09 한국과학기술원 Method and system of text to multiple speech
US20220358703A1 (en) * 2019-06-21 2022-11-10 Deepbrain Ai Inc. Method and device for generating speech video on basis of machine learning
US20220399025A1 (en) * 2019-06-21 2022-12-15 Deepbrain Ai Inc. Method and device for generating speech video using audio signal
US20220375190A1 (en) * 2020-08-25 2022-11-24 Deepbrain Ai Inc. Device and method for generating speech video
US20220084273A1 (en) * 2020-09-12 2022-03-17 Jingdong Digits Technology Holding Co., Ltd. System and method for synthesizing photo-realistic video of a speech
US20230125839A1 (en) * 2021-10-27 2023-04-27 Samsung Sds Co., Ltd. Method and apparatus for generating synthetic data
CN115643341A (en) * 2022-10-14 2023-01-24 杭州半云科技有限公司 Artificial intelligence customer service response system
CN115758282A (en) * 2022-11-07 2023-03-07 上海蜜度信息技术有限公司 Cross-modal sensitive information identification method, system and terminal
CN115937369A (en) * 2022-11-21 2023-04-07 之江实验室 Expression animation generation method and system, electronic equipment and storage medium
CN116884427A (en) * 2023-05-10 2023-10-13 华中科技大学 Embedded vector processing method based on end-to-end deep learning voice re-etching model
CN116932712A (en) * 2023-06-30 2023-10-24 上海蜜度信息技术有限公司 Multi-mode input interactive information generation method, device, equipment and medium
CN116741197A (en) * 2023-08-11 2023-09-12 上海蜜度信息技术有限公司 Multi-mode image generation method and device, storage medium and electronic equipment

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
邓立新, 杨震, 郑宝玉: "一种新的基于语音波形图像模板匹配的S/U/V判别法", 南京邮电学院学报(自然科学版), no. 01, pages 38 - 42 *

Also Published As

Publication number Publication date
CN117292024B (en) 2024-04-12

Similar Documents

Publication Publication Date Title
CN110321958B (en) Training method of neural network model and video similarity determination method
WO2022017025A1 (en) Image processing method and apparatus, storage medium, and electronic device
CN112633419A (en) Small sample learning method and device, electronic equipment and storage medium
CN110070867B (en) Speech instruction recognition method, computer device and computer-readable storage medium
CN111553477A (en) Image processing method, device and storage medium
CN113239169A (en) Artificial intelligence-based answer generation method, device, equipment and storage medium
CN113468344B (en) Entity relationship extraction method and device, electronic equipment and computer readable medium
CN114627868A (en) Intention recognition method and device, model and electronic equipment
CN114065915A (en) Network model construction method, data processing method, device, medium and equipment
CN111859933B (en) Training method, recognition method, device and equipment for maleic language recognition model
CN110675865B (en) Method and apparatus for training hybrid language recognition models
CN117292024B (en) Voice-based image generation method and device, medium and electronic equipment
CN111126056A (en) Method and device for identifying trigger words
CN111401069A (en) Intention recognition method and intention recognition device for conversation text and terminal
CN114998668A (en) Feature extraction method and device, storage medium and electronic equipment
CN114898734A (en) Pre-training method and device based on speech synthesis model and electronic equipment
CN114155388A (en) Image recognition method and device, computer equipment and storage medium
CN114792086A (en) Information extraction method, device, equipment and medium supporting text cross coverage
CN113222113B (en) Signal generation method and device based on deconvolution layer
CN116341640B (en) Text processing model training method and device
CN110808035B (en) Method and apparatus for training hybrid language recognition models
CN111460812B (en) Sentence emotion classification method and related equipment
CN115062673B (en) Image processing method, image processing device, electronic equipment and storage medium
WO2024114154A1 (en) Noise data determination model training method and apparatus, and noise data determination method and apparatus
CN118038887A (en) Mixed voice processing method, device, computer equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant